Multiprocessing scikit-learn

11,402

Solution 1

I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.

For the multiprocessing: You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.

Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.

How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs in SGDClassifier.

Solution 2

For linear models (LinearSVC, SGDClassifier, Perceptron...) you can chunk your data, train independent models on each chunk and build an aggregate linear model (e.g. SGDClasifier) by sticking in it the average values of coef_ and intercept_ as attributes. The predict method of LinearSVC, SGDClassifier, Perceptron compute the same function (linear prediction using a dot product with an intercept_ threshold and One vs All multiclass support) so the specific model class you use for holding the average coefficient is not important.

However as previously said the tricky point is parallelizing the feature extraction and current scikit-learn (version 0.12) does not provide any way to do this easily.

Edit: scikit-learn 0.13+ now has a hashing vectorizer that is stateless.

Share:
11,402
Phyo Arkar Lwin
Author by

Phyo Arkar Lwin

Will Add Later!

Updated on June 21, 2022

Comments

  • Phyo Arkar Lwin
    Phyo Arkar Lwin almost 2 years

    I got linearsvc working against training set and test set using load_file method i am trying to get It working on Multiprocessor enviorment.

    How can i get multiprocessing work on LinearSVC().fit() LinearSVC().predict()? I am not really familiar with datatypes of scikit-learn yet.

    I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.

    Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?

    EDIT: Here is my scenario:

    lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?

    Thanks.

    PS: Please do not close this .

    • Qnan
      Qnan over 11 years
      Correct me if I'm wrong, but an SVM usually doesn't take long to make a decision. It might make more sense to perform the decoding for different samples in parallel than to parallelize the decoding for one sample.
    • Phyo Arkar Lwin
      Phyo Arkar Lwin over 11 years
      what if i am going to do that on 21 million documents? Would it take long?
    • Phyo Arkar Lwin
      Phyo Arkar Lwin over 11 years
      I am thinking about different samples too , is it able to re-combine different samples after spliting them for each process?
    • Qnan
      Qnan over 11 years
      I don't think I get your question. The samples are independent. Why do you have to re-combine something?
    • Phyo Arkar Lwin
      Phyo Arkar Lwin over 11 years
      lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?
    • Phyo Arkar Lwin
      Phyo Arkar Lwin over 11 years
      so for those splits of samples , i need to recombine at the end of multiprocessing to get back training sets.
    • Qnan
      Qnan over 11 years
      I see. You mentioned only testing before, which is why I was surprised. Once the model is trained, the decision can be made for each sample in the testing set independently, so that parallelizes well. Training is a different thing, however, -- parallelizing SVM training is by no means trivial, and to my knowledge scikit-learn doesn't implement it.
    • ogrisel
      ogrisel over 11 years
      Tfidfvectorizer is not parallelizable be case of the central vocabulary. We either need a shared vocabulary (e.g. using a redis server on the cluster) or implement a HashVectorizer that does not exists yet.
    • John Thompson
      John Thompson over 11 years
      What is the status of the hashing vectorizer? I would also like to be able to use joblib.Parallel for vectorization.
    • Phyo Arkar Lwin
      Phyo Arkar Lwin about 11 years
      I see some pull request on 14.0 github regarding parallel. I haven't got a chance to test it yet coz we are already in development using 13.0
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 11 years
    There will be only 3 classes. SGDClassifer is as accurate as LinearSVC? I will test it around.
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 11 years
    Thanks man, I will look test it out. so Tfidfvectorizer is not parallizable yet right? and , Feature Extraction takes the most time on our tests.
  • ogrisel
    ogrisel over 11 years
    Yes this is a known limitation of scikit-learn. Efficient parallelizable text feature extraction using a hashing vectorizer is high on my personal priority list, but I have got even higher priorities right now :)
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 11 years
    Ic , if i want to contribute where should i start?
  • ogrisel
    ogrisel over 11 years
    If you want to contribute a hashing text vectorizer you should first get familiar with the existing CountVectorizer implementation by reading its source code and the source code of related files. Then read the following paper Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, Josh Attenberg, Feature Hashing for Large Scale Multitask Learning, ICML 2009, then have a look at this pull request on a hashing transformer that is closely related but not a hashing text vectorizer.
  • ogrisel
    ogrisel over 11 years
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 11 years
    Thanks a lot . i will look into it. Actually I was also looking into Count Vectorizer and sees some places where multiprocessing can work. I am already thinking about putting python standard multiprocessing.pool() on some loops. such as : _word_ngrams() ,_char_wb_ngrams() methods without even using hashing vectorizor
  • ogrisel
    ogrisel over 11 years
    Please use joblib.Parallel rather than multiprocessing loops directly (see other usage in the scikit-learn source code for example). AFAIK, we did try to parallelize such inner loops but the overhead does not make it interesting at this level.
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 11 years
    i c , i do not have experience with joblib.Parallel and not sure about its performance (And stability) .
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 11 years
    I will look into it. FYI i have a question on extracting features out of each files (i want to show top 10 terms out of each files in test data set): stackoverflow.com/q/13181409/200044