Multiprocessing scikit-learn
Solution 1
I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.
For the multiprocessing: You can distribute the data sets across cores, do partial_fit
, get the weight vectors, average them, distribute them to the estimators, do partial fit again.
Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.
How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs
in SGDClassifier.
Solution 2
For linear models (LinearSVC
, SGDClassifier
, Perceptron
...) you can chunk your data, train independent models on each chunk and build an aggregate linear model (e.g. SGDClasifier
) by sticking in it the average values of coef_
and intercept_
as attributes. The predict
method of LinearSVC
, SGDClassifier
, Perceptron
compute the same function (linear prediction using a dot product with an intercept_
threshold and One vs All multiclass support) so the specific model class you use for holding the average coefficient is not important.
However as previously said the tricky point is parallelizing the feature extraction and current scikit-learn (version 0.12) does not provide any way to do this easily.
Edit: scikit-learn 0.13+ now has a hashing vectorizer that is stateless.
Comments
-
Phyo Arkar Lwin almost 2 years
I got linearsvc working against training set and test set using
load_file
method i am trying to get It working on Multiprocessor enviorment.How can i get multiprocessing work on
LinearSVC().fit()
LinearSVC().predict()
? I am not really familiar with datatypes of scikit-learn yet.I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.
Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?
EDIT: Here is my scenario:
lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?
Thanks.
PS: Please do not close this .
-
Qnan over 11 yearsCorrect me if I'm wrong, but an SVM usually doesn't take long to make a decision. It might make more sense to perform the decoding for different samples in parallel than to parallelize the decoding for one sample.
-
Phyo Arkar Lwin over 11 yearswhat if i am going to do that on 21 million documents? Would it take long?
-
Phyo Arkar Lwin over 11 yearsI am thinking about different samples too , is it able to re-combine different samples after spliting them for each process?
-
Qnan over 11 yearsI don't think I get your question. The samples are independent. Why do you have to re-combine something?
-
Phyo Arkar Lwin over 11 yearslets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?
-
Phyo Arkar Lwin over 11 yearsso for those splits of samples , i need to recombine at the end of multiprocessing to get back training sets.
-
Qnan over 11 yearsI see. You mentioned only testing before, which is why I was surprised. Once the model is trained, the decision can be made for each sample in the testing set independently, so that parallelizes well. Training is a different thing, however, -- parallelizing SVM training is by no means trivial, and to my knowledge scikit-learn doesn't implement it.
-
ogrisel over 11 years
Tfidfvectorizer
is not parallelizable be case of the central vocabulary. We either need a shared vocabulary (e.g. using a redis server on the cluster) or implement aHashVectorizer
that does not exists yet. -
John Thompson over 11 yearsWhat is the status of the hashing vectorizer? I would also like to be able to use joblib.Parallel for vectorization.
-
Phyo Arkar Lwin about 11 yearsI see some pull request on 14.0 github regarding parallel. I haven't got a chance to test it yet coz we are already in development using 13.0
-
-
Phyo Arkar Lwin over 11 yearsThere will be only 3 classes. SGDClassifer is as accurate as LinearSVC? I will test it around.
-
Phyo Arkar Lwin over 11 yearsThanks man, I will look test it out. so Tfidfvectorizer is not parallizable yet right? and , Feature Extraction takes the most time on our tests.
-
ogrisel over 11 yearsYes this is a known limitation of scikit-learn. Efficient parallelizable text feature extraction using a hashing vectorizer is high on my personal priority list, but I have got even higher priorities right now :)
-
Phyo Arkar Lwin over 11 yearsIc , if i want to contribute where should i start?
-
ogrisel over 11 yearsIf you want to contribute a hashing text vectorizer you should first get familiar with the existing
CountVectorizer
implementation by reading its source code and the source code of related files. Then read the following paper Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, Josh Attenberg, Feature Hashing for Large Scale Multitask Learning, ICML 2009, then have a look at this pull request on a hashing transformer that is closely related but not a hashing text vectorizer. -
ogrisel over 11 yearsThe read the contributors guide of scikit-learn.
-
Phyo Arkar Lwin over 11 yearsThanks a lot . i will look into it. Actually I was also looking into Count Vectorizer and sees some places where multiprocessing can work. I am already thinking about putting python standard multiprocessing.pool() on some loops. such as : _word_ngrams() ,_char_wb_ngrams() methods without even using hashing vectorizor
-
ogrisel over 11 yearsPlease use
joblib.Parallel
rather than multiprocessing loops directly (see other usage in the scikit-learn source code for example). AFAIK, we did try to parallelize such inner loops but the overhead does not make it interesting at this level. -
Phyo Arkar Lwin over 11 yearsi c , i do not have experience with joblib.Parallel and not sure about its performance (And stability) .
-
Phyo Arkar Lwin over 11 yearsI will look into it. FYI i have a question on extracting features out of each files (i want to show top 10 terms out of each files in test data set): stackoverflow.com/q/13181409/200044