Attribute error while using scikit-learn
Since I'm running the development (pre-0.14) version, where the feature_extraction.text
module got overhauled, I don't get the same error message. But I suspect you can solve this issue with:
vectorizer = CountVectorizer(stop_words=stopWords, min_df=1)
The min_df
parameter causes CountVectorizer
to throw away any term that occurs in too few documents (because it won't have any predictive value). By default, it's set to 2, which means all your terms get thrown away, so you get an empty vocabulary.
Related videos on Youtube
Animesh Pandey
LinkedIn Profile : http://www.linkedin.com/in/animeshpandey Github Profile : https://github.com/apanimesh061
Updated on August 20, 2022Comments
-
Animesh Pandey over 1 year
I am trying to find similar questions using scikit using cosine similarity. I was trying this sample code available on the internet. Link1 and Link2
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from nltk.corpus import stopwords import numpy as np import numpy.linalg as LA train_set = ["The sky is blue.", "The sun is bright."] test_set = ["The sun in the sky is bright."] stopWords = stopwords.words('english') vectorizer = CountVectorizer(stop_words = stopWords) transformer = TfidfTransformer() trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() trainVectorizerArray = vectorizer. testVectorizerArray = vectorizer.transform(test_set).toarray() print 'Fit Vectorizer to train set', trainVectorizerArray print 'Transform Vectorizer to test set', testVectorizerArray cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3) for vector in trainVectorizerArray: print vector for testV in testVectorizerArray: print testV cosine = cx(vector, testV) print cosine transformer.fit(trainVectorizerArray) print transformer.transform(trainVectorizerArray).toarray() transformer.fit(testVectorizerArray) tfidf = transformer.transform(testVectorizerArray) print tfidf.todense()
I always get this error
Traceback (most recent call last): File "C:\Users\Animesh\Desktop\NLP\ngrams2.py", line 14, in <module> trainVectorizerArray = vectorizer.fit_transform(train_set).toarray() File "C:\Python27\lib\site-packages\scikit_learn-0.13.1-py2.7-win32.egg\sklearn \feature_extraction\text.py", line 740, in fit_transform raise ValueError("empty vocabulary; training set may have" ValueError: empty vocabulary; training set may have contained only stop words or min_df (resp. max_df) may be too high (resp. too low).
I even checked the code available on this link. There I got error
AttributeError: 'CountVectorizer' object has no attribute 'vocabulary'
.How to solve this issue ?
I am using Python 2.7.3 on Windows 7 32 Bit and scikit_learn 0.13.1.
-
Animesh Pandey about 11 yearsOh! That solved the issue.. But could you tell me whats with the vocabulary function... why is it giving an attribute error when I try to use this function
-
Fred Foo about 11 years@AnimeshPandey: it's right there in the error message: "empty vocabulary; training set may have contained only stop words or min_df (resp. max_df) may be too high (resp. too low)." As I explained, the default setting
min_df=2
is too low because you only have two documents. (Mind you, tf-idf doesn't work properly with so few documents either.) -
ogrisel about 11 years
vocabulary_
with a trailing_
is extracted when calling the fit method (unless provided as a constructor parameters by the user). See the documentation.