scikit-learn TfidfVectorizer meaning?

machine-learning nlp scikit-learn feature-extraction document-classification

28,017

Solution 1

TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator.

vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.

What is?(e.g.: u'me': 8 )

It tells you that the token 'me' is represented as feature number 8 in the output matrix.

is this a matrix or just a vector?

Each sentence is a vector, the sentences you've entered are matrix with 3 vectors. In each vector the numbers (weights) represent features tf-idf score. For example: 'julie': 4 --> Tells you that the in each sentence 'Julie' appears you will have non-zero (tf-idf) weight. As you can see in the 2'nd vector:

[ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ]

The 5'th element scored 0.51785612 - the tf-idf score for 'Julie'. For more info about Tf-Idf scoring read here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

Solution 2

So tf-idf creates a set of its own vocabulary from the entire set of documents. Which is seen in first line of output. (for better understanding I have sorted it)

{u'baseball': 0, u'basketball': 1, u'he': 2, u'jane': 3, u'julie': 4, u'likes': 5, u'linda': 6,  u'loves': 7, u'me': 8, u'more': 9, u'than': 10, }

And when the document is parsed to get its tf-idf. Document:

He watches basketball and baseball

and its output,

[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ]

is equivalent to,

[baseball basketball he jane julie likes linda loves me more than]

Since our document has only these words: baseball, basketball, he, from the vocabulary created. The document vector output has values of tf-idf for only these three words and in the same sorted vocabulary position.

tf-idf is used to classify documents, ranking in search engine. tf: term frequency(count of the words present in document from its own vocabulary), idf: inverse document frequency(importance of the word to each document).

Solution 3

The method addresses the fact that all words should not be weighted equally, using the weights to indicate the words that are most unique to the document, and best used to characterize it.

new_docs = ['basketball baseball', 'basketball baseball', 'basketball baseball']
new_term_freq_matrix = vectorizer.fit_transform(new_docs)
print (vectorizer.vocabulary_)
print ((new_term_freq_matrix.todense()))


{'basketball': 1, 'baseball': 0}
    [[ 0.70710678  0.70710678]
     [ 0.70710678  0.70710678]
     [ 0.70710678  0.70710678]]

new_docs = ['basketball baseball', 'basketball basketball', 'basketball basketball']
new_term_freq_matrix = vectorizer.fit_transform(new_docs)
print (vectorizer.vocabulary_)
print ((new_term_freq_matrix.todense()))

{'basketball': 1, 'baseball': 0}
    [[ 0.861037    0.50854232]
     [ 0.          1.        ]
     [ 0.          1.        ]] 

new_docs = ['basketball basketball baseball', 'basketball basketball', 'basketball 
basketball']
new_term_freq_matrix = vectorizer.fit_transform(new_docs)
print (vectorizer.vocabulary_)
print ((new_term_freq_matrix.todense())) 


{'basketball': 1, 'baseball': 0}
[[ 0.64612892  0.76322829]
[ 0.          1.        ]
[ 0.          1.        ]]

28,017

anon

Updated on January 24, 2020

Comments

anon over 4 years

I was reading about TfidfVectorizer implementation of scikit-learn, i don´t understand what´s the output of the method, for example:

new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball']
new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)
print tfidf_vectorizer.vocabulary_
print new_term_freq_matrix.todense()

output:

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}
[[ 0.57735027  0.57735027  0.57735027  0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.          0.68091856  0.          0.          0.51785612  0.51785612
   0.          0.          0.          0.          0.        ]
 [ 0.62276601  0.          0.          0.62276601  0.          0.          0.
   0.4736296   0.          0.          0.        ]]

What is?(e.g.: u'me': 8 ):

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}

is this a matrix or just a vector?, i can´t understand what´s telling me the output:

[[ 0.57735027  0.57735027  0.57735027  0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.          0.68091856  0.          0.          0.51785612  0.51785612
   0.          0.          0.          0.          0.        ]
 [ 0.62276601  0.          0.          0.62276601  0.          0.          0.
   0.4736296   0.          0.          0.        ]]

Could anybody explain me in more detail these outputs?

Thanks!

BluePython almost 8 years

what is the u parameter in the output? Using a fresh download of Anaconda/Scikit and it is not showing up. Is it now not displayed in the output?
BluePython almost 8 years

FYI - it is the difference between unicode or not (which is specified on versions before Python 3).
Akash Kandpal almost 6 years

this one explains better. Thanks, mate.