Why does the fit and the partial_fit of the sklearn LatentDirichletAllocation return different results ?
Not exactly the same code; partial_fit
uses total_samples
:
" total_samples : int, optional (default=1e6) Total number of documents. Only used in the partial_fit method."
https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L184
(partial fit) https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L472
(fit) https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L510
Just in case it is of your interest: partial_fit
is a good candidate to be used whenever your dataset is really really big. So, instead of running into possible memory problems you perform your fitting in smaller batches, which is called incremental learning.
So, in your case you should take into account that total_samples
default's value is 1000000.0
. Therefore, if you don't change this number and your real number of samples is bigger then you'll get different results from the fit
method and fit_partial
. Or maybe it could be the case that you are using mini-batches in the fit_partial
and not covering all the samples that you provide to the fit
method. And even if you do this right, you could also get different results, as stated in the documentation:
- "the incremental learner itself may be unable to cope with new/unseen targets classes. In this case you have to pass all the possible classes to the first partial_fit call using the classes= parameter."
- "[...] choosing a proper algorithm is that all of them don’t put the same importance on each example over time [...]"
sklearn documentation: https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning
Related videos on Youtube
augustin-barillec
Updated on September 27, 2022Comments
-
augustin-barillec over 1 year
What is strange is that it seems to be exactly the same code for the fit and for the partial_fit.
You can see the code at the following link :
https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L478