Find the tf-idf score of specific words in documents using sklearn
11,485
Solution 1
Yes. See .vocabulary_
on your fitted/transformed TF-IDF vectorizer.
In [1]: from sklearn.datasets import fetch_20newsgroups
In [2]: data = fetch_20newsgroups(categories=['rec.autos'])
In [3]: from sklearn.feature_extraction.text import TfidfVectorizer
In [4]: cv = TfidfVectorizer()
In [5]: X = cv.fit_transform(data.data)
In [6]: cv.vocabulary_
It is a dictionary of the form:
{word : column index in array}
Solution 2
This is another solution with CountVectorizer
and TfidfTransformer
that finds Tfidf
score for a given word:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# our corpus
data = ['I like dog', 'I love cat', 'I interested in cat']
cv = CountVectorizer()
# convert text data into term-frequency matrix
data = cv.fit_transform(data)
tfidf_transformer = TfidfTransformer()
# convert term-frequency matrix into tf-idf
tfidf_matrix = tfidf_transformer.fit_transform(data)
# create dictionary to find a tfidf word each word
word2tfidf = dict(zip(cv.get_feature_names(), tfidf_transformer.idf_))
for word, score in word2tfidf.items():
print(word, score)
Output:
(u'love', 1.6931471805599454)
(u'like', 1.6931471805599454)
(u'i', 1.0)
(u'dog', 1.6931471805599454)
(u'cat', 1.2876820724517808)
(u'interested', 1.6931471805599454)
(u'in', 1.6931471805599454)
Author by
WhiteTiger
Updated on June 14, 2022Comments
-
WhiteTiger almost 2 years
I have code that runs basic TF-IDF vectorizer on a collection of documents, returning a sparse matrix of D X F where D is the number of documents and F is the number of terms. No problem.
But how do I find the TF-IDF score of a specific term in the document? i.e. is there some sort of dictionary between terms (in their textual representation) and their position in the resulting sparse matrix?