Sklearn SGDClassifier partial fit

python machine-learning scikit-learn gradient-descent

34,985

I have finally found the answer. You need to shuffle the training data between each iteration, as setting shuffle=True when instantiating the model will NOT shuffle the data when using partial_fit (it only applies to fit). Note: it would have been helpful to find this information on the sklearn.linear_model.SGDClassifier page.

The amended code reads as follows:

from sklearn.linear_model import SGDClassifier
import random
clf2 = SGDClassifier(loss='log') # shuffle=True is useless here
shuffledRange = range(len(X))
n_iter = 5
for n in range(n_iter):
    random.shuffle(shuffledRange)
    shuffledX = [X[i] for i in shuffledRange]
    shuffledY = [Y[i] for i in shuffledRange]
    for batch in batches(range(len(shuffledX)), 10000):
        clf2.partial_fit(shuffledX[batch[0]:batch[-1]+1], shuffledY[batch[0]:batch[-1]+1], classes=numpy.unique(Y))

34,985

David M.

Updated on July 15, 2020

Comments

David M. almost 4 years
I'm trying to use SGD to classify a large dataset. As the data is too large to fit into memory, I'd like to use the partial_fit method to train the classifier. I have selected a sample of the dataset (100,000 rows) that fits into memory to test fit vs. partial_fit:
```
from sklearn.linear_model import SGDClassifier

def batches(l, n):
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

clf1 = SGDClassifier(shuffle=True, loss='log')
clf1.fit(X, Y)

clf2 = SGDClassifier(shuffle=True, loss='log')
n_iter = 60
for n in range(n_iter):
    for batch in batches(range(len(X)), 10000):
        clf2.partial_fit(X[batch[0]:batch[-1]+1], Y[batch[0]:batch[-1]+1], classes=numpy.unique(Y))
```
I then test both classifiers with an identical test set. In the first case I get an accuracy of 100%. As I understand it, SGD by default passes 5 times over the training data (n_iter = 5).

In the second case, I have to pass 60 times over the data to reach the same accuracy.

Why this difference (5 vs. 60)? Or am I doing something wrong?
- Fred Foo almost 10 years
  
  Give verbose=1 to the SGD constructor, that may give you a hint.
- David M. almost 10 years
  
  First case (fit) ends with "-- Epoch 5 Norm: 29.25, NNZs: 300, Bias: -1.674706, T: 459595, Avg. loss: 0.076786". Second case (partial_fit) after 10 passes ends with "-- Epoch 1 Norm: 22.99, NNZs: 300, Bias: -1.999685, T: 1918, Avg. loss: 0.089302". What should I be looking for? thx
- Fred Foo almost 10 years
  
  The average loss. Check if it drops faster in the batch case.
- David M. almost 10 years
  
  In the first case it drops from 0.087027 to 0.076786 in 15 passes (5 epochs; 3 passes/epoch). In the second case it's difficult to tell because it seems to me that the avg loss figures relate to each individual batch; hence great variations in the numbers (e.g. the last 10 figures are 0.000748; 0.258055; 0.001160; 0.267540; 0.036631; 0.291704; 0.197599; 0.012074; 0.109227; 0.089302).
Fabio Picchi over 6 years

Shuffling the whole dataset would not be possible as the data does not fit in memory (if it did, we could simply use fit). Does shuffling the data inside the batches yield better results?
skeller88 over 4 years

If your data is stored in a way that is compatible with indexing such as filenames to files, or indexes to locations in an array, then you can store the data indexes separate from the data, and shuffle the indexes between each epoch.
Thariq Nugrohotomo over 2 years

It's being mentioned in the current version of user-guide scikit-learn.org/stable/modules/sgd.html shuffle after each iteration. I remember that the shuffling is mentioned in Andrew Ng youtube lecture too.