Sklearn SGDClassifier partial fit
I have finally found the answer. You need to shuffle the training data between each iteration, as setting shuffle=True when instantiating the model will NOT shuffle the data when using partial_fit (it only applies to fit). Note: it would have been helpful to find this information on the sklearn.linear_model.SGDClassifier page.
The amended code reads as follows:
from sklearn.linear_model import SGDClassifier
import random
clf2 = SGDClassifier(loss='log') # shuffle=True is useless here
shuffledRange = range(len(X))
n_iter = 5
for n in range(n_iter):
random.shuffle(shuffledRange)
shuffledX = [X[i] for i in shuffledRange]
shuffledY = [Y[i] for i in shuffledRange]
for batch in batches(range(len(shuffledX)), 10000):
clf2.partial_fit(shuffledX[batch[0]:batch[-1]+1], shuffledY[batch[0]:batch[-1]+1], classes=numpy.unique(Y))
Related videos on Youtube
![David M.](https://i.stack.imgur.com/Qix5c.jpg?s=256&g=1)
David M.
Updated on July 15, 2020Comments
-
David M. almost 4 years
I'm trying to use SGD to classify a large dataset. As the data is too large to fit into memory, I'd like to use the partial_fit method to train the classifier. I have selected a sample of the dataset (100,000 rows) that fits into memory to test fit vs. partial_fit:
from sklearn.linear_model import SGDClassifier def batches(l, n): for i in xrange(0, len(l), n): yield l[i:i+n] clf1 = SGDClassifier(shuffle=True, loss='log') clf1.fit(X, Y) clf2 = SGDClassifier(shuffle=True, loss='log') n_iter = 60 for n in range(n_iter): for batch in batches(range(len(X)), 10000): clf2.partial_fit(X[batch[0]:batch[-1]+1], Y[batch[0]:batch[-1]+1], classes=numpy.unique(Y))
I then test both classifiers with an identical test set. In the first case I get an accuracy of 100%. As I understand it, SGD by default passes 5 times over the training data (n_iter = 5).
In the second case, I have to pass 60 times over the data to reach the same accuracy.
Why this difference (5 vs. 60)? Or am I doing something wrong?
-
Fred Foo almost 10 yearsGive
verbose=1
to the SGD constructor, that may give you a hint. -
David M. almost 10 yearsFirst case (fit) ends with "-- Epoch 5 Norm: 29.25, NNZs: 300, Bias: -1.674706, T: 459595, Avg. loss: 0.076786". Second case (partial_fit) after 10 passes ends with "-- Epoch 1 Norm: 22.99, NNZs: 300, Bias: -1.999685, T: 1918, Avg. loss: 0.089302". What should I be looking for? thx
-
Fred Foo almost 10 yearsThe average loss. Check if it drops faster in the batch case.
-
David M. almost 10 yearsIn the first case it drops from 0.087027 to 0.076786 in 15 passes (5 epochs; 3 passes/epoch). In the second case it's difficult to tell because it seems to me that the avg loss figures relate to each individual batch; hence great variations in the numbers (e.g. the last 10 figures are 0.000748; 0.258055; 0.001160; 0.267540; 0.036631; 0.291704; 0.197599; 0.012074; 0.109227; 0.089302).
-
-
Fabio Picchi over 6 yearsShuffling the whole dataset would not be possible as the data does not fit in memory (if it did, we could simply use fit). Does shuffling the data inside the batches yield better results?
-
skeller88 over 4 yearsIf your data is stored in a way that is compatible with indexing such as filenames to files, or indexes to locations in an array, then you can store the data indexes separate from the data, and shuffle the indexes between each epoch.
-
Thariq Nugrohotomo over 2 yearsIt's being mentioned in the current version of user-guide scikit-learn.org/stable/modules/sgd.html
shuffle after each iteration
. I remember that the shuffling is mentioned in Andrew Ng youtube lecture too.