Multiple classification models in a scikit pipeline python
Consider checking out similar questions here:
To summarize,
Here is an easy way to optimize over any classifier and for each classifier any settings of parameters.
Create a switcher class that works for any estimator
from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):
def __init__(
self,
estimator = SGDClassifier(),
):
"""
A Custom BaseEstimator that can switch between classifiers.
:param estimator: sklearn object - The classifier
"""
self.estimator = estimator
def fit(self, X, y=None, **kwargs):
self.estimator.fit(X, y)
return self
def predict(self, X, y=None):
return self.estimator.predict(X)
def predict_proba(self, X):
return self.estimator.predict_proba(X)
def score(self, X, y):
return self.estimator.score(X, y)
Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:
Perform hyper-parameter optimization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', ClfSwitcher()),
])
parameters = [
{
'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': ['english', None],
'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
'clf__estimator__max_iter': [50, 80],
'clf__estimator__tol': [1e-4],
'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
},
{
'clf__estimator': [MultinomialNB()],
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': [None],
'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
},
]
gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)
How to interpret clf__estimator__loss
clf__estimator__loss
is interpreted as the loss
parameter for whatever estimator
is, where estimator = SGDClassifier()
in the top most example and is itself a parameter of clf
which is a ClfSwitcher
object.
denbuttigieg
Updated on July 26, 2022Comments
-
denbuttigieg almost 2 years
I am solving a binary classification problem over some text documents using Python and implementing the
scikit-learn
library, and I wish to try different models to compare and contrast results - mainly using a Naive Bayes Classifier, SVM with K-Fold CV, and CV=5. I am finding a difficulty in combining all of the methods into one pipeline, given that the latter two models usegridSearchCV()
. I cannot have multiple Pipelines running during a single implementation due to concurrency issues, hence I need to implement all the different models using one pipeline.This is what I have till now,
# pipeline for naive bayes naive_bayes_pipeline = Pipeline([ ('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')), ('tf_idf', TfidfTransformer()), ('classifier', MultinomialNB()) ]) # accessing and using the pipelines naive_bayes = naive_bayes_pipeline.fit(train_data['data'], train_data['gender']) # pipeline for SVM svm_pipeline = Pipeline([ ('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')), ('tf_idf', TfidfTransformer()), ('classifier', SVC()) ]) param_svm = [ {'classifier__C': [1, 10], 'classifier__kernel': ['linear']}, {'classifier__C': [1, 10], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']}, ] grid_svm_skf = GridSearchCV( svm_pipeline, # pipeline from above param_grid=param_svm, # parameters to tune via cross validation refit=True, # fit using all data, on the best detected classifier n_jobs=-1, # number of cores to use for parallelization; -1 uses "all cores" scoring='accuracy', cv=StratifiedKFold(train_data['gender'], n_folds=5), # using StratifiedKFold CV with 5 folds ) svm_skf = grid_svm_skf.fit(train_data['data'], train_data['gender']) predictions_svm_skf = svm_skf.predict(test_data['data'])
EDIT 1: The second pipeline is the only pipeline using
gridSearchCV()
, and never seems to be executed.EDIT 2: Added more code to show
gridSearchCV()
use. -
slaw over 5 yearsI am familiar with
GridSearchCV
in the traditional case with one estimator. Can you explain what is actually happening in theGridSearchCV
when you provide parameters with two estimators? Does it perform 5-fold CV twice (i.e., one round for theSGDClassifier
and one round forMultinomialNB
) and then repeat it for each set of grid parameters? -
slaw over 5 yearsDo you know if it is possible to provide multiple datasets as a parameter so that I can fit different estimators with different datasets?
-
cgnorthcutt over 5 yearsSure..
for dataset in datasets: gscv.fit(...)
-
slaw over 5 yearsI don't think that would work as the multiple calls to
gscv.fit
would clobber the fit from the last dataset. I want each of the calls tofit
with different datasets to be appended. -
cgnorthcutt over 5 yearsClobber? Just initialize each time.
gscv = GridSearchCV(); gscv.fit()
There isn't much more to this. -
GSA about 2 years@ cgnorthcutt how does one extract the scores for say each estimator ( SGDClassifier() or MultinomialNB()), given that it's not using named_steps?