Why does the fit and the partial_fit of the sklearn LatentDirichletAllocation return different results ?

12,284

Not exactly the same code; partial_fit uses total_samples:

" total_samples : int, optional (default=1e6) Total number of documents. Only used in the partial_fit method."

https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L184

(partial fit) https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L472

(fit) https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L510

Just in case it is of your interest: partial_fit is a good candidate to be used whenever your dataset is really really big. So, instead of running into possible memory problems you perform your fitting in smaller batches, which is called incremental learning.

So, in your case you should take into account that total_samples default's value is 1000000.0. Therefore, if you don't change this number and your real number of samples is bigger then you'll get different results from the fit method and fit_partial. Or maybe it could be the case that you are using mini-batches in the fit_partial and not covering all the samples that you provide to the fit method. And even if you do this right, you could also get different results, as stated in the documentation:

  • "the incremental learner itself may be unable to cope with new/unseen targets classes. In this case you have to pass all the possible classes to the first partial_fit call using the classes= parameter."
  • "[...] choosing a proper algorithm is that all of them don’t put the same importance on each example over time [...]"

sklearn documentation: https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning

Share:
12,284

Related videos on Youtube

augustin-barillec
Author by

augustin-barillec

Updated on September 27, 2022

Comments