Memory efficient way to split large numpy array into train and test

14,003

Solution 1

One method that I've tried which works is to store X in a pandas dataframe and shuffle

X = X.reindex(np.random.permutation(X.index))

since I arrive at the same memory error when I try

np.random.shuffle(X)

Then, I convert the pandas dataframe back to a numpy array and using this function, I can obtain a train test split

#test_proportion of 3 means 1/3 so 33% test and 67% train
def shuffle(matrix, target, test_proportion):
    ratio = int(matrix.shape[0]/test_proportion) #should be int
    X_train = matrix[ratio:,:]
    X_test =  matrix[:ratio,:]
    Y_train = target[ratio:,:]
    Y_test =  target[:ratio,:]
    return X_train, X_test, Y_train, Y_test

X_train, X_test, Y_train, Y_test = shuffle(X, Y, 3)

This works for now, and when I want to do k-fold cross-validation, I can iteratively loop k times and shuffle the pandas dataframe. While this suffices for now, why does numpy and sci-kit learn's implementations of shuffle and train_test_split result in memory errors for big arrays?

Solution 2

Another way to use the sklearn split method with reduced memory usage is to generate an index vector of X and split on this vector. Afterwards you can select your entries and e.g. write training and test splits to the disk.

import h5py
import numpy as np
from sklearn.cross_validation import train_test_split

X = np.random.random((10000,70000))
Y = np.random.random((10000,))

x_ids = list(range(len(X)))
x_train_ids, x_test_ids, Y_train, Y_test = train_test_split(x_ids, Y, test_size = 0.33, random_state=42)

# Write

f = h5py.File('dataset/train.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_train_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_train, dtype=np.int)
f.close()

f = h5py.File('dataset/test.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_test_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_test, dtype=np.int)
f.close()

# Read

f = h5py.File('dataset/train.h5py', 'r')
X_train = np.array(f.get('inputs'), dtype=np.int)
Y_train = np.array(f.get('labels'), dtype=np.int)
f.close()

f = h5py.File('dataset/test.h5py', 'r')
X_test = np.array(f.get('inputs'), dtype=np.int)
Y_test = np.array(f.get('labels'), dtype=np.int)
f.close()

Solution 3

I came across a similar problem.

As mentioned by @user1879926, I think shuffle is a main cause of memory exhaustion.

And ,as 'Shuffle' is claimed to be an invalid parameter for model_selection.train_test_split cited, train_test_split in sklearn 0.19 has option disabling shuffle.

So, I think you can escape from memory error by just adding shuffle=False option.

Solution 4

I faced the same problem with my code. I was using a dense array like you and ran out of memory. I converted my training data to sparse (I am doing document classification) and solved my issue.

Share:
14,003

Related videos on Youtube

user1879926
Author by

user1879926

Updated on September 27, 2022

Comments

  • user1879926
    user1879926 over 1 year

    I have a large numpy array and when I run scikit learn's train_test_split to split the array into training and test data, I always run into memory errors. What would be a more memory efficient method of splitting into train and test, and why does the train_test_split cause this?

    The follow code results in a memory error and causes a crash

    import numpy as np
    from sklearn.cross_validation import train_test_split
    
    X = np.random.random((10000,70000))
    Y = np.random.random((10000,))
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state=42)
    
    • wwii
      wwii almost 9 years
    • eickenberg
      eickenberg almost 9 years
      works for me on a 64G machine, had big problems on a 8G laptop (would have probably led to memory error if I hadn't killed it). The issue is most probably that train/test split inevitably makes copies of the data, because it uses fancy indexing, whereas in a situation without randomization, e.g. KFold, this could be avoided (but you would have to code the split yourself, because sklearn's KFold also copies). If you need randomization you could consider inplace shuffling the data first.
  • Bruno Feroleto
    Bruno Feroleto almost 9 years
    An array of 10,000 x 70,000 NumPy floats has 700 MB elements, where each elements takes 8 bytes, so this array uses about 6 GB of memory. This is actually sizable.
  • DMML
    DMML almost 9 years
    I suppose size is all relative -- in personal computer terms, definitely sizable. HPC terms, not so much.
  • user1879926
    user1879926 almost 9 years
    Does the code snippet in my question work for any one of you?
  • DMML
    DMML almost 9 years
    @user1879926 Yes. On a machine with 48Gb memory. Which is why I was asking what machine you were running.
  • user1879926
    user1879926 almost 9 years
    My Macbook has 16gb RAM and about 500 gb of free disk space.
  • Austin
    Austin over 3 years
    If your model can learn in batches from a generator, this method is also great for getting splits from sklearn (and this works with stratification, too). Instead of the list of indices, you can also create a list of paths pointing to your files. You wouldn't need the writing and reading in that case.
  • madprogramer
    madprogramer over 3 years
    This deserves to be the accepted answer! No need for numpy necromancery