Scikit-Learn's Pipeline: A sparse matrix was passed, but dense data is required

37,247

Solution 1

Unfortunately those two are incompatible. A CountVectorizer produces a sparse matrix and the RandomForestClassifier requires a dense matrix. It is possible to convert using X.todense(). Doing this will substantially increase your memory footprint.

Below is sample code to do this based on http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html which allows you to call .todense() in a pipeline stage.

class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

Once you have your DenseTransformer, you are able to add it as a pipeline step.

pipeline = Pipeline([
     ('vectorizer', CountVectorizer()), 
     ('to_dense', DenseTransformer()), 
     ('classifier', RandomForestClassifier())
])

Another option would be to use a classifier meant for sparse data like LinearSVC.

from sklearn.svm import LinearSVC
pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', LinearSVC())])

Solution 2

The most terse solution would be use a FunctionTransformer to convert to dense: this will automatically implement the fit, transform and fit_transform methods as in David's answer. Additionally if I don't need special names for my pipeline steps, I like to use the sklearn.pipeline.make_pipeline convenience function to enable a more minimalist language for describing the model:

from sklearn.preprocessing import FunctionTransformer

pipeline = make_pipeline(
     CountVectorizer(), 
     FunctionTransformer(lambda x: x.todense(), accept_sparse=True), 
     RandomForestClassifier()
)

Solution 3

Random forests in 0.16-dev now accept sparse data.

Solution 4

you can change pandas Series to arrays using the .values method.

pipeline.fit(df[0].values, df[1].values)

However I think the issue here happens because CountVectorizer() returns a sparse matrix by default, and cannot be piped to the RF classifier. CountVectorizer() does have a dtype parameter to specify the type of array returned. That said usually you need to do some sort of dimensionality reduction to use random forests for text classification, because bag of words feature vectors are very long

Share:
37,247

Related videos on Youtube

Ada Stra
Author by

Ada Stra

Updated on July 09, 2022

Comments

  • Ada Stra
    Ada Stra almost 2 years

    I'm finding it difficult to understand how to fix a Pipeline I created (read: largely pasted from a tutorial). It's python 3.4.2:

    df = pd.DataFrame
    df = DataFrame.from_records(train)
    
    test = [blah1, blah2, blah3]
    
    pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', RandomForestClassifier())])
    
    pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1]))
    predicted = pipeline.predict(test)
    

    When I run it, I get:

    TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
    

    This is for the line pipeline.fit(numpy.asarray(df[0]), numpy.asarray(df[1])).

    I've experimented a lot with solutions through numpy, scipy, and so forth, but I still don't know how to fix it. And yes, similar questions have come up before, but not inside a pipeline. Where is it that I have to apply toarray or todense?

  • Ada Stra
    Ada Stra over 9 years
    I see, thanks a lot, makes sense now. I tried upvoting you but I don't have enough reputation?
  • Ada Stra
    Ada Stra over 9 years
    Thanks a lot! I am experimenting with different classifiers, in part to learn, and in part to find what works best. Truth be told, for my case I get by far best results with multinomial NB. I'll experiment with your code, thanks so much for the exhaustive answer.
  • David Maust
    David Maust over 9 years
    Sounds fun. RandomForest is good for dense numeric data. I've found it doesn't scale that well for sparse text features. If you do want to try it on text, you might try adding a feature selection stage first. That can sometimes work well. My favorites for text have been LinearSVC and SGDClassifier using either loss='modified_huber' or loss='log'.
  • stackit
    stackit over 8 years
    What parameters to use for a clasifer based POS tagger application using SGD?
  • dactylroot
    dactylroot over 7 years
    I just tried this and saw the accept_sparse parameter of FunctionTransformer. You need to set it to True.
  • Jarad
    Jarad over 6 years
    For those of you that use @maxymoo's solution as much as I do, FunctionTransformer can be imported from sklearn.preprocessing import FunctionTransformer
  • Guido
    Guido over 5 years
    I get an error when adding the FunctionTransformer: AttributeError: Can't pickle local object 'main large.<locals>.<lambda>' pipeline. Any hints on how to fix it?
  • maxymoo
    maxymoo over 5 years
    @guido use dill instead of pickle
  • Dror
    Dror over 5 years
    @Guido I am guessing you're trying to use the pipeline inside some cross validation / grid search. Under the hood, the pipeline is pickled and the problem is that lambda functions cannot be pickled. Therefore, you have to extract the lambda functionality into a regular function def to_dense(x): and use it instead of the lambda.
  • Joselo
    Joselo almost 2 years
    This worked for me! I was using Naive Bayes in the pipeline which also requires a dense matrix.