Fastest SVM implementation usable in Python

31,500

Solution 1

The most scalable kernel SVM implementation I know of is LaSVM. It's written in C hence wrap-able in Python if you know Cython, ctypes or cffi. Alternatively you can use it from the command line. You can use the utilities in sklearn.datasets to load convert data from a NumPy or CSR format into svmlight formatted files that LaSVM can use as training / test set.

Solution 2

Alternatively you can run the grid search on 1000 random samples instead of the full dataset:

>>> from sklearn.cross_validation import ShuffleSplit
>>> cv = ShuffleSplit(3, test_fraction=0.2, train_fraction=0.2, random_state=0)
>>> gs = GridSeachCV(clf, params_grid, cv=cv, n_jobs=-1, verbose=2)
>>> gs.fit(X, y)

It's very likely that the optimal parameters for 5000 samples will be very close to the optimal parameters for 1000 samples. So that's a good way to start your coarse grid search.

n_jobs=-1 makes it possible to use all your CPUs to run the individual CV fits in parallel. It's using mulitprocessing so the python GIL is not an issue.

Solution 3

Firstly, according to scikit-learn's benchmark (here), scikit-learn is already one of the fastest if not fastest SVM package around. Hence, you might want to consider other ways of speeding up the training.

As suggested by bavaza, you can try to multi-thread the training process. If you are using Scikit-learn's GridSearchCV class, you can easily set n_jobs argument to be larger than the default value of 1 to perform the training in parallel at the expense of using more memory. You can find its the documentation here An example of how to use the class can be found here

Alternatively, you can take a look at Shogun Machine Learning Library here

Shogun is designed for large scale machine learning with wrappers to many common svm packages and it is implemented in C/C++ with bindings for python. According to Scikit-learn's benchmark above, it's speed is comparable to scikit-learn. On other tasks (other than the one they demonstrated), it might be faster so it is worth giving a try.

Lastly, you can try to perform dimension reduction e.g. using PCA or randomized PCA to reduce the dimension of your feature vectors. That would speed up the training process. The documentation for the respective classes can be found in these 2 links: PCA, Randomized PCA . You can find examples on how to use them in Scikit-learn's examples section.

Solution 4

If you are interested in only using the RBF kernel (or any other quadratic kernel for that matter), then I suggest using LIBSVM on MATLAB or Octave. I train a model of 7000 observations and 500 features in about 6 seconds.

The trick is to use precomputed kernels that LIBSVM provides, and use some matrix algebra to compute the kernel in one step instead of lopping over the data twice. The kernel takes about two seconds to build as opposed to a lot more using LIBSVM own RBF kernel. I presume you would be able to do so in Python using NumPy, but I am not sure as I have not tried it.

Solution 5

Without going to much into comparing SVM libraries, I think the task you are describing (cross-validation) can benefit from real multi-threading (i.e. running several CPUs in parallel). If you are using CPython, it does not take advantage of your (probably)-multi-core machine, due to GIL.

You can try other implementations of Python which don't have this limitation. See PyPy or IronPython if you are willing to go to .NET.

Share:
31,500
tomas
Author by

tomas

Updated on July 24, 2020

Comments

  • tomas
    tomas almost 4 years

    I'm building some predictive models in Python and have been using scikits learn's SVM implementation. It's been really great, easy to use, and relatively fast.

    Unfortunately, I'm beginning to become constrained by my runtime. I run a rbf SVM on a full dataset of about 4 - 5000 with 650 features. Each run takes about a minute. But with a 5 fold cross validation + grid search (using a coarse to fine search), it's getting a bit unfeasible for my task at hand. So generally, do people have any recommendations in terms of the fastest SVM implementation that can be used in Python? That, or any ways to speed up my modeling?

    I've heard of LIBSVM's GPU implementation, which seems like it could work. I don't know of any other GPU SVM implementations usable in Python, but it would definitely be open to others. Also, does using the GPU significantly increase runtime?

    I've also heard that there are ways of approximating the rbf SVM by using a linear SVM + feature map in scikits. Not sure what people think about this approach. Again, anyone using this approach, is it a significant increase in runtime?

    All ideas for increasing the speed of program is most welcome.

  • tomas
    tomas over 12 years
    Thanks bavaza I'll take a look into it. Assuming I do take advantage of my multicore computer, any other suggestions on speeding up my program? I was going figure out a way to cross validate across multiple threads anyways. However, I think I still need a speed up.
  • tomas
    tomas over 12 years
    Thanks ogrisel. I'll take a look at this. Definitely looks interesting. Sklearn can export into svm light format? That will definitely be useful. In response to your prior answer, unfortunately, I'm dealing with timeseries, so random sampling + spitting into train/test becomes quite a bit more complicated. Not sure subsampling to train my model will be all that straightforward. Thanks!
  • tomas
    tomas over 12 years
    Sorry quick addendum ogrisel, do you know what utility function in sklearn can export in SVM light format?
  • ogrisel
    ogrisel over 12 years
    Indeed it's missing from the doc but it's there: github.com/scikit-learn/scikit-learn/blob/master/sklearn/…
  • ogrisel
    ogrisel over 12 years
    @thomas If your samples are not (loosely) iid there is a lot of chance that SVM with a generic kernel such as RBF will not yield good results. If you have time-series data (with time dependencies between consecutive measurements) you should either extract higher level features (e.g. convolutions over sliding windows or STFT) or precompute a time series dedicated kernel.
  • tomas
    tomas over 12 years
    Hmm... interesting. Do you mind expanding on what you said? I've heard of dependent data causing issues for cross validation procedures, but not specifically for a rbf SVM. What issues can arise? And any references or pointers on what is meant by extracting higher level features? Don't know if the comment section is the best place, but would love to hear more about this. thanks.
  • ogrisel
    ogrisel over 12 years
    If the inter-samples time dependencies prevent you to do arbitrary sub-sampling & cross-validation, I don't see how the SVM RBF model will be able to learn something general: the model makes its predictions for each individual sample one at a time, independently of past predictions (no memory) hence the input features should encode some kind of high level "context" if you want it to generalize enough to make interesting predictions on previously unseen data.
  • tomas
    tomas over 12 years
  • Phyo Arkar Lwin
    Phyo Arkar Lwin over 11 years
    @bavaza , I have been running Python in Multiple cores for many years , it works very well. Please research multiprocessing lib of standard CPython.
  • bavaza
    bavaza over 11 years
    @V3ss0n, thanks. Looks like a nice lib. As it uses processes and not threads, are you familiar with any context-switching penalties (e.g. when using a large worker pool)?
  • mrgloom
    mrgloom about 11 years
    Generally speaking LibSVM is a good mature lib, but I think it's not the fastest and 7000 x 500 is very small problem to test.
  • halflings
    halflings over 9 years
    PyPy also has a GIL (even if they have an experimental project to implement an alternate memory management strategy); As some have said, to avoid the GIL the easiest way to go is still multiprocessing instead of using threading. I'm really not sure using IronPython will give better performance (with all the .NET overhead)