How is the TFIDFVectorizer in scikit-learn supposed to work?

63,510

Solution 1

From scikit-learn documentation:

As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model.

As you can see, TfidfVectorizer is a CountVectorizer followed by TfidfTransformer.

What you are probably looking for is TfidfTransformer and not TfidfVectorizer

Solution 2

I believe your issue lies in using different stopword lists. Scikit-learn and NLTK use different stopword lists by default. For scikit-learn it is usually a good idea to have a custom stop_words list passed to TfidfVectorizer, e.g.:

my_stopword_list = ['and','to','the','of']
my_vectorizer = TfidfVectorizer(stop_words=my_stopword_list)

Doc page for TfidfVectorizer class: [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html][1]

Solution 3

using below code I get much better results

vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words='english')

Output

sustain    0.045090
bone       0.045090
thou       0.044417
thee       0.043673
timely     0.043269
thy        0.042731
prime      0.041628
absence    0.041234
rib        0.041234
feel       0.040259
Name: Adam, dtype: float64

and

thee          0.071188
thy           0.070549
forbids       0.069358
thou          0.068068
early         0.064642
earliest      0.062229
dreamed       0.062229
firmness      0.062229
glistering    0.062229
sweet         0.060770
Name: Eve, dtype: float64

Solution 4

I'm not sure why it's not the default, but you probably want sublinear_tf=True in the initialization for TfidfVectorizer. I forked your repo and sent you a PR with an example that probably looks more like what you want.

Share:
63,510
Jonathan
Author by

Jonathan

Updated on November 22, 2020

Comments

  • Jonathan
    Jonathan over 3 years

    I'm trying to get words that are distinctive of certain documents using the TfIDFVectorizer class in scikit-learn. It creates a tfidf matrix with all the words and their scores in all the documents, but then it seems to count common words, as well. This is some of the code I'm running:

    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(contents)
    feature_names = vectorizer.get_feature_names()
    dense = tfidf_matrix.todense()
    denselist = dense.tolist()
    df = pd.DataFrame(denselist, columns=feature_names, index=characters)
    s = pd.Series(df.loc['Adam'])
    s[s > 0].sort_values(ascending=False)[:10]
    

    I expected this to return a list of distinctive words for the document 'Adam', but what it does it return a list of common words:

    and     0.497077
    to      0.387147
    the     0.316648
    of      0.298724
    in      0.186404
    with    0.144583
    his     0.140998
    

    I might not understand it perfectly, but as I understand it, tf-idf is supposed to find words that are distinctive of one document in a corpus, finding words that appear frequently in one document, but not in other documents. Here, and appears frequently in other documents, so I don't know why it's returning a high value here.

    The complete code I'm using to generate this is in this Jupyter notebook.

    When I compute tf/idfs semi-manually, using the NLTK and computing scores for each word, I get the appropriate results. For the 'Adam' document:

    fresh        0.000813
    prime        0.000813
    bone         0.000677
    relate       0.000677
    blame        0.000677
    enough       0.000677
    

    That looks about right, since these are words that appear in the 'Adam' document, but not as much in other documents in the corpus. The complete code used to generate this is in this Jupyter notebook.

    Am I doing something wrong with the scikit code? Is there another way to initialize this class where it returns the right results? Of course, I can ignore stopwords by passing stop_words = 'english', but that doesn't really solve the problem, since common words of any sort shouldn't have high scores here.

  • Jonathan
    Jonathan about 8 years
    TfidfTransformer will transform the output of CountVectorizer, so I can run CountVectorizer and then run TfidfTransformer, but that's the same as running TfidfVectorizer. So I'm not convinced I need TfidfTransformer, if I'm going to have to run CountVectorizer first anyway. Won't it return the same results?
  • Jonathan
    Jonathan about 8 years
    That's good to know, but I guess I'm confused about why one needs to remove stopwords to begin with. If 'and' or 'the' occurs frequently in all documents, let's say, then why would it have a high tf-idf value? It seems to me that the point of tf-idf is to adjust for the term's frequency across all documents, so that terms that occur frequently across the corpus won't appear at the top of the list.
  • Rabbit
    Rabbit about 8 years
    @Jono, I guess your intuition is that TFIDF should benefit rare terms. This is half true. TFIDF takes into account two main things: TF, which is the term frequency in the document, and IDF, which is the inverse term frequency over the whole set of documents. TF benefits frequent terms, while IDF benefits rare terms. These two are almost opposing measures, which makes the TFIDF a balanced metric.
  • Rabbit
    Rabbit about 8 years
    Also, stopword removal is a very common practice when using a vector-space representation. We can reason this way: for most applications, you want to have a metric that is high for important terms and low/zero for non-important ones. If your representation (TFIDF in this case) fails to do that, you counter this by removing a term that does will not help and potentially will hurt your model.
  • Jonathan
    Jonathan about 8 years
    Awesome. That's a big improvement. But if you run it with a smaller set of characters, instead of all the characters, you get lists of commonly-used words again: github.com/JonathanReeve/milton-analysis/blob/v0.2/… "And," "to," "the," and "of" are the words with the highest tf-idfs for Adam and Eve, but those are words that appear frequently across the corpus, so I don't know why they're getting hi tf-idf scores here.
  • fnl
    fnl about 8 years
    Because you are now using much fewer documents. So the IDF, that grows in the number of times the term is found in a document (i.e., its a per document count), doesn't get very large with just four documents (<=4 for any term) and you don't have enough "statistical power".
  • realmq
    realmq almost 6 years
    @Jono, how come I get different result by running the same code. The only code difference is "vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words='english')", then I seem to get much reasonable output for adam: sustain 0.045090 bone 0.045090 thou 0.044417 thee 0.043673 timely 0.043269 thy 0.042731 prime 0.041628 absence 0.041234 rib 0.041234 feel 0.040259