How to apply standardization to SVMs in scikit-learn?
Solution 1
Neither.
scaler.transform(X_train)
doesn't have any effect. The transform
operation is not in-place.
You have to do
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
or
X_train = scaler.fit(X_train).transform(X_train)
You always need to do the same preprocessing on both training or test data. And yes, standardization is always good if it reflects your believe for the data. In particular for kernel-svms it is often crucial.
Solution 2
Why not use a Pipeline
to chain (or combine) transformers and estimators in one go? Saves you the hassle of separately fitting and transforming your data and then using the estimator. It would save some space, too.
from sklearn.pipeline import Pipeline
pipe_lrSVC = Pipeline([('scaler', StandardScaler()), ('clf', LinearSVC())])
pipe_lrSVC.fit(X_train, y_train)
y_pred = pipe_lrSVC.predict(X_test)
Related videos on Youtube
pemistahl
Computational linguist and software engineer, currently interested in Kotlin and Rust programming
Updated on September 04, 2020Comments
-
pemistahl over 3 years
I'm using the current stable version 0.13 of scikit-learn. I'm applying a linear support vector classifier to some data using the class
sklearn.svm.LinearSVC
.In the chapter about preprocessing in scikit-learn's documentation, I've read the following:
Many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
Question 1: Is standardization useful for SVMs in general, also for those with a linear kernel function as in my case?
Question 2: As far as I understand, I have to compute the mean and standard deviation on the training data and apply this same transformation on the test data using the class
sklearn.preprocessing.StandardScaler
. However, what I don't understand is whether I have to transform the training data as well or just the test data prior to feeding it to the SVM classifier.That is, do I have to do this:
scaler = StandardScaler() scaler.fit(X_train) # only compute mean and std here X_test = scaler.transform(X_test) # perform standardization by centering and scaling clf = LinearSVC() clf.fit(X_train, y_train) clf.predict(X_test)
Or do I have to do this:
scaler = StandardScaler() X_train = scaler.fit_transform(X_train) # compute mean, std and transform training data as well X_test = scaler.transform(X_test) # same as above clf = LinearSVC() clf.fit(X_train, y_train) clf.predict(X_test)
In short, do I have to use
scaler.fit(X_train)
orscaler.fit_transform(X_train)
on the training data in order to get reasonable results withLinearSVC
? -
pemistahl over 11 yearsSure, I'm aware of this. I was just too lazy to post it (shame on me). The keypoint is whether to use
fit()
orfit_transform()
onX_train
. -
Andreas Mueller over 11 yearsAdded a comment. To rephrase your question again, it is not about
fit
orfit_transform
but whether to transform both the test and the training data. The answer is: definitely. If you transform only one, how could you expect to learn anything? They would not be from the same distribution any more. -
pemistahl over 11 yearsAlright, that's what I wanted to know. I'm pretty new to SVMs and was a bit confused. Anyway, thanks for your quick reaction. :)
-
john doe almost 8 years@AndreasMueller do I need to scale my features if I am using gradient boosting classification?.
-
Andreas Mueller almost 8 yearsNot if you are using trees as weak learners. All tree-based models are agnostic to scaling.
-
Agostino over 7 yearsAre sure about calling
transform
on the test set? The example in this doc page usesfit
on the test set instead oftransform
. -
Andreas Mueller over 7 years@Agostino Which line? Doesn't look like that to me. If it does, it's a bug and we need to fix the example.
-
Agostino over 7 yearsYou are right. No idea if it was edited or if I saw it somewhere else. Thanks.