Memory efficient way to split large numpy array into train and test

python arrays scikit-learn

14,003

Solution 1

One method that I've tried which works is to store X in a pandas dataframe and shuffle

X = X.reindex(np.random.permutation(X.index))

since I arrive at the same memory error when I try

np.random.shuffle(X)

Then, I convert the pandas dataframe back to a numpy array and using this function, I can obtain a train test split

#test_proportion of 3 means 1/3 so 33% test and 67% train
def shuffle(matrix, target, test_proportion):
    ratio = int(matrix.shape[0]/test_proportion) #should be int
    X_train = matrix[ratio:,:]
    X_test =  matrix[:ratio,:]
    Y_train = target[ratio:,:]
    Y_test =  target[:ratio,:]
    return X_train, X_test, Y_train, Y_test

X_train, X_test, Y_train, Y_test = shuffle(X, Y, 3)

This works for now, and when I want to do k-fold cross-validation, I can iteratively loop k times and shuffle the pandas dataframe. While this suffices for now, why does numpy and sci-kit learn's implementations of shuffle and train_test_split result in memory errors for big arrays?

Solution 2

Another way to use the sklearn split method with reduced memory usage is to generate an index vector of X and split on this vector. Afterwards you can select your entries and e.g. write training and test splits to the disk.

import h5py
import numpy as np
from sklearn.cross_validation import train_test_split

X = np.random.random((10000,70000))
Y = np.random.random((10000,))

x_ids = list(range(len(X)))
x_train_ids, x_test_ids, Y_train, Y_test = train_test_split(x_ids, Y, test_size = 0.33, random_state=42)

# Write

f = h5py.File('dataset/train.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_train_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_train, dtype=np.int)
f.close()

f = h5py.File('dataset/test.h5py', 'w')
f.create_dataset(f"inputs", data=X[x_test_ids], dtype=np.int)
f.create_dataset(f"labels", data=Y_test, dtype=np.int)
f.close()

# Read

f = h5py.File('dataset/train.h5py', 'r')
X_train = np.array(f.get('inputs'), dtype=np.int)
Y_train = np.array(f.get('labels'), dtype=np.int)
f.close()

f = h5py.File('dataset/test.h5py', 'r')
X_test = np.array(f.get('inputs'), dtype=np.int)
Y_test = np.array(f.get('labels'), dtype=np.int)
f.close()

Solution 3

I came across a similar problem.

As mentioned by @user1879926, I think shuffle is a main cause of memory exhaustion.

And ,as 'Shuffle' is claimed to be an invalid parameter for model_selection.train_test_split cited, train_test_split in sklearn 0.19 has option disabling shuffle.

So, I think you can escape from memory error by just adding shuffle=False option.

Solution 4

I faced the same problem with my code. I was using a dense array like you and ran out of memory. I converted my training data to sparse (I am doing document classification) and solved my issue.

View more solutions

14,003

user1879926

Updated on September 27, 2022

Comments

user1879926 over 1 year
I have a large numpy array and when I run scikit learn's train_test_split to split the array into training and test data, I always run into memory errors. What would be a more memory efficient method of splitting into train and test, and why does the train_test_split cause this?

The follow code results in a memory error and causes a crash
```
import numpy as np
from sklearn.cross_validation import train_test_split

X = np.random.random((10000,70000))
Y = np.random.random((10000,))
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state=42)
```
- wwii almost 9 years
  
  This may be of interest - numpy-discussion.10968.n7.nabble.com/Huge-arrays-td25254.htm‌l
- eickenberg almost 9 years
  
  works for me on a 64G machine, had big problems on a 8G laptop (would have probably led to memory error if I hadn't killed it). The issue is most probably that train/test split inevitably makes copies of the data, because it uses fancy indexing, whereas in a situation without randomization, e.g. KFold, this could be avoided (but you would have to code the split yourself, because sklearn's KFold also copies). If you need randomization you could consider inplace shuffling the data first.
Bruno Feroleto almost 9 years

An array of 10,000 x 70,000 NumPy floats has 700 MB elements, where each elements takes 8 bytes, so this array uses about 6 GB of memory. This is actually sizable.
DMML almost 9 years

I suppose size is all relative -- in personal computer terms, definitely sizable. HPC terms, not so much.
user1879926 almost 9 years

Does the code snippet in my question work for any one of you?
DMML almost 9 years

@user1879926 Yes. On a machine with 48Gb memory. Which is why I was asking what machine you were running.
user1879926 almost 9 years

My Macbook has 16gb RAM and about 500 gb of free disk space.
Austin over 3 years

If your model can learn in batches from a generator, this method is also great for getting splits from sklearn (and this works with stratification, too). Instead of the list of indices, you can also create a list of paths pointing to your files. You wouldn't need the writing and reading in that case.
madprogramer over 3 years

This deserves to be the accepted answer! No need for numpy necromancery