AttributeError: lower not found; using a Pipeline with a CountVectorizer in scikit-learn
Solution 1
It's because your dataset is in wrong format, you should pass "An iterable which yields either str, unicode or file objects" into CountVectorizer's fit function (Or into pipeline, doesn't matter). Not iterable over other iterables with texts (as in your code). In your case List is iterable, and you should pass flat list whose members are strings (not another lists).
i.e. your dataset should look like:
X_train = ['this is an dummy example',
'in reality this line is very long',
...
'here is a last text in the training set'
]
Look at this example, very useful: Sample pipeline for text feature extraction and evaluation
Solution 2
You can pass data like this:
from sklearn import metrics
text_clf.fit(list(X_train), list(y_train))
predicted = text_clf.predict(list(X_test))
print(metrics.classification_report(list(y_test), predicted))
Comments
-
tumultous_rooster almost 2 years
I have a corpus as such:
X_train = [ ['this is an dummy example'] ['in reality this line is very long'] ... ['here is a last text in the training set'] ]
and some labels:
y_train = [1, 5, ... , 3]
I would like to use Pipeline and GridSearch as follows:
pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('reg', SGDRegressor()) ]) parameters = { 'vect__max_df': (0.5, 0.75, 1.0), 'tfidf__use_idf': (True, False), 'reg__alpha': (0.00001, 0.000001), } grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1) grid_search.fit(X_train, y_train)
When I run this, I get an error saying
AttributeError: lower not found
.I searched and found a question about this error here, which lead me to believe that there was a problem with my text not being tokenized (which sounded like it hit the nail on the head, since I was using a list of list as input data, where each list contained one single unbroken string).
I cooked up a quick and dirty tokenizer to test this theory:
def my_tokenizer(X): newlist = [] for alist in X: newlist.append(alist[0].split(' ')) return newlist
which does what it is supposed to, but when I use it in the arguments to the
CountVectorizer
:pipeline = Pipeline([ ('vect', CountVectorizer(tokenizer=my_tokenizer)),
...I still get the same error as if nothing happened.
I did notice that I can circumvent the error by commenting out the
CountVectorizer
in my Pipeline. Which is strange...I didn't think you could use theTfidfTransformer()
without first having a data structure to transform...in this case the matrix of counts.Why do I keep getting this error? Actually, it would be nice to know what this error means! (Was
lower
called to convert the text to lowercase or something? I can't tell from reading the stack trace). Am I misusing the Pipeline...or is the problem really an issue with the arguments to theCountVectorizer
alone?Any advice would be greatly appreciated.
-
tumultous_rooster over 8 yearsCoincidentally, I based my code off this example. Since the example pulls it's data from
sklearn.datasets.fetch_20newsgroups
, it is unclear what format that data is in (list? matrix?). The documentation isn't very helpful on this detail either. -
Ibraim Ganiev over 8 years@MattO'Brien Yep, i can only recommend to use IPython console or Jupyter notebooks (Or simply standard python interpreter / debugger, if you don't want to install additional software), to see intermediate results, it helps a lot in understanding of such small details.
-
tumultous_rooster over 8 yearsI do use iPython notebook but merely read the example and modified it for my own purposed. I didn't actually execute it the original example, assuming that the input was a list of lists. I should have done my due-diligence.