AttributeError: lower not found; using a Pipeline with a CountVectorizer in scikit-learn

30,159

Solution 1

It's because your dataset is in wrong format, you should pass "An iterable which yields either str, unicode or file objects" into CountVectorizer's fit function (Or into pipeline, doesn't matter). Not iterable over other iterables with texts (as in your code). In your case List is iterable, and you should pass flat list whose members are strings (not another lists).

i.e. your dataset should look like:

X_train = ['this is an dummy example',
      'in reality this line is very long',
      ...
      'here is a last text in the training set'
    ]

Look at this example, very useful: Sample pipeline for text feature extraction and evaluation

Solution 2

You can pass data like this:

from sklearn import metrics
text_clf.fit(list(X_train), list(y_train))
predicted = text_clf.predict(list(X_test))
print(metrics.classification_report(list(y_test), predicted))
Share:
30,159
tumultous_rooster
Author by

tumultous_rooster

SOreadytohelp

Updated on July 09, 2022

Comments

  • tumultous_rooster
    tumultous_rooster almost 2 years

    I have a corpus as such:

    X_train = [ ['this is an dummy example'] 
          ['in reality this line is very long']
          ...
          ['here is a last text in the training set']
        ]
    

    and some labels:

    y_train = [1, 5, ... , 3]
    

    I would like to use Pipeline and GridSearch as follows:

    pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('reg', SGDRegressor())
    ])
    
    
    parameters = {
        'vect__max_df': (0.5, 0.75, 1.0),
        'tfidf__use_idf': (True, False),
        'reg__alpha': (0.00001, 0.000001),
    }
    
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)
    
    grid_search.fit(X_train, y_train)
    

    When I run this, I get an error saying AttributeError: lower not found.

    I searched and found a question about this error here, which lead me to believe that there was a problem with my text not being tokenized (which sounded like it hit the nail on the head, since I was using a list of list as input data, where each list contained one single unbroken string).

    I cooked up a quick and dirty tokenizer to test this theory:

    def my_tokenizer(X):
        newlist = []
        for alist in X:
            newlist.append(alist[0].split(' '))
        return newlist
    

    which does what it is supposed to, but when I use it in the arguments to the CountVectorizer:

    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=my_tokenizer)),
    

    ...I still get the same error as if nothing happened.

    I did notice that I can circumvent the error by commenting out the CountVectorizer in my Pipeline. Which is strange...I didn't think you could use the TfidfTransformer() without first having a data structure to transform...in this case the matrix of counts.

    Why do I keep getting this error? Actually, it would be nice to know what this error means! (Was lower called to convert the text to lowercase or something? I can't tell from reading the stack trace). Am I misusing the Pipeline...or is the problem really an issue with the arguments to the CountVectorizer alone?

    Any advice would be greatly appreciated.

  • tumultous_rooster
    tumultous_rooster over 8 years
    Coincidentally, I based my code off this example. Since the example pulls it's data from sklearn.datasets.fetch_20newsgroups, it is unclear what format that data is in (list? matrix?). The documentation isn't very helpful on this detail either.
  • Ibraim Ganiev
    Ibraim Ganiev over 8 years
    @MattO'Brien Yep, i can only recommend to use IPython console or Jupyter notebooks (Or simply standard python interpreter / debugger, if you don't want to install additional software), to see intermediate results, it helps a lot in understanding of such small details.
  • tumultous_rooster
    tumultous_rooster over 8 years
    I do use iPython notebook but merely read the example and modified it for my own purposed. I didn't actually execute it the original example, assuming that the input was a list of lists. I should have done my due-diligence.