Attribute error while using scikit-learn

10,239

Since I'm running the development (pre-0.14) version, where the feature_extraction.text module got overhauled, I don't get the same error message. But I suspect you can solve this issue with:

vectorizer = CountVectorizer(stop_words=stopWords, min_df=1)

The min_df parameter causes CountVectorizer to throw away any term that occurs in too few documents (because it won't have any predictive value). By default, it's set to 2, which means all your terms get thrown away, so you get an empty vocabulary.

Share:
10,239

Related videos on Youtube

Animesh Pandey
Author by

Animesh Pandey

LinkedIn Profile : http://www.linkedin.com/in/animeshpandey Github Profile : https://github.com/apanimesh061

Updated on August 20, 2022

Comments

  • Animesh Pandey
    Animesh Pandey over 1 year

    I am trying to find similar questions using scikit using cosine similarity. I was trying this sample code available on the internet. Link1 and Link2

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from nltk.corpus import stopwords
    import numpy as np
    import numpy.linalg as LA
    
    train_set = ["The sky is blue.", "The sun is bright."]
    test_set = ["The sun in the sky is bright."]
    stopWords = stopwords.words('english')
    
    vectorizer = CountVectorizer(stop_words = stopWords)
    transformer = TfidfTransformer()
    
    trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
    trainVectorizerArray = vectorizer.
    testVectorizerArray = vectorizer.transform(test_set).toarray()
    print 'Fit Vectorizer to train set', trainVectorizerArray
    print 'Transform Vectorizer to test set', testVectorizerArray
    cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)
    
    for vector in trainVectorizerArray:
        print vector
        for testV in testVectorizerArray:
            print testV
            cosine = cx(vector, testV)
            print cosine
    
    transformer.fit(trainVectorizerArray)
    print transformer.transform(trainVectorizerArray).toarray()
    
    transformer.fit(testVectorizerArray)
    tfidf = transformer.transform(testVectorizerArray)
    print tfidf.todense()
    

    I always get this error

    Traceback (most recent call last):
    File "C:\Users\Animesh\Desktop\NLP\ngrams2.py", line 14, in <module>
    trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
    File "C:\Python27\lib\site-packages\scikit_learn-0.13.1-py2.7-win32.egg\sklearn  \feature_extraction\text.py", line 740, in fit_transform
    raise ValueError("empty vocabulary; training set may have"
    ValueError: empty vocabulary; training set may have contained only stop words or min_df  (resp. max_df) may be too high (resp. too low).
    

    I even checked the code available on this link. There I got error AttributeError: 'CountVectorizer' object has no attribute 'vocabulary'.

    How to solve this issue ?

    I am using Python 2.7.3 on Windows 7 32 Bit and scikit_learn 0.13.1.

  • Animesh Pandey
    Animesh Pandey about 11 years
    Oh! That solved the issue.. But could you tell me whats with the vocabulary function... why is it giving an attribute error when I try to use this function
  • Fred Foo
    Fred Foo about 11 years
    @AnimeshPandey: it's right there in the error message: "empty vocabulary; training set may have contained only stop words or min_df (resp. max_df) may be too high (resp. too low)." As I explained, the default setting min_df=2 is too low because you only have two documents. (Mind you, tf-idf doesn't work properly with so few documents either.)
  • ogrisel
    ogrisel about 11 years
    vocabulary_ with a trailing _ is extracted when calling the fit method (unless provided as a constructor parameters by the user). See the documentation.