Sklearn How to Save a Model Created From a Pipeline and GridSearchCV Using Joblib or Pickle?

python scikit-learn pipeline grid-search

44,429

import joblib
joblib.dump(grid.best_estimator_, 'filename.pkl')

If you want to dump your object into one file - use:

joblib.dump(grid.best_estimator_, 'filename.pkl', compress = 1)

44,429

Author by

Jarad

I am many things. Programmer - Python primarily, trying to learn Javascript more Data scientist - machine learning with Python (Scikit-learn, Tensorflow, Keras, PyTorch, etc.) Entrepreneur - Build paid search software, creator of an index card sleeve (very useful next to your desk while you code), online course creator, paid advertising consultant and mentor, and so on.

Updated on July 09, 2022

Comments

Jarad almost 2 years

After identifying the best parameters using a pipeline and GridSearchCV, how do I pickle/joblib this process to re-use later? I see how to do this when it's a single classifier...

from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')

But how do I save this overall pipeline with the best parameters after performing and completing a gridsearch?

I tried:

joblib.dump(grid, 'output.pkl') - But that dumped every gridsearch attempt (many files)
joblib.dump(pipeline, 'output.pkl') - But I don't think that contains the best parameters

X_train = df['Keyword']
y_train = df['Ad Group']

pipeline = Pipeline([
  ('tfidf', TfidfVectorizer()),
  ('sgd', SGDClassifier())
  ])

parameters = {'tfidf__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'tfidf__max_df': [0.25, 0.5, 0.75, 1.0],
              'tfidf__max_features': [10, 50, 100, 250, 500, 1000, None],
              'tfidf__stop_words': ('english', None),
              'tfidf__smooth_idf': (True, False),
              'tfidf__norm': ('l1', 'l2', None),
              }

grid = GridSearchCV(pipeline, parameters, cv=2, verbose=1)
grid.fit(X_train, y_train)

#These were the best combination of tuning parameters discovered
##best_params = {'tfidf__max_features': None, 'tfidf__use_idf': False,
##               'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 2),
##               'tfidf__max_df': 1.0, 'tfidf__stop_words': 'english',
##               'tfidf__norm': 'l2'}

Odisseo about 5 years

As a best practice, once the best model has been selected, one should retrain it on the entire dataset. In order to do so, should one retrain the same pipeline object on the entire dataset (thus applying the same data processing) and then deploy that very object? Or should one recreate a new model?
brian_ds over 4 years

@Odisseo - My opinion is that you retrain a new model starting from scratch. You can still use a pipeline, but you change your grid_classifier to your final classifier (say a Random forest). Add that classifier to the pipeline, retrain using all the data. Save the end model. - The end result is your entire data set was trained inside the full pipeline you desire. This may lead to slightly different preprocessing for instance, but it should be more robust. In reality, this means you call pipeline.fit() and save the pipeline.
Federico Dorato about 4 years

@Odisseo I'm a little bit late but... GridSearchCV automatically retrain the model on the entire dataset, unless you explicitly ask it not to do it. So, when you train the GridSearchCV model, the model you use for predicting (in other words, the best_estimator_) is already retrained on the whole dataset.