Creating a TF-IDF Matrix Python 3.6

python python-3.x matrix information-retrieval tf-idf

10,900

Your code is working fine. I am giving an example with a couple of sentences. Here one sentence is equivalent to a document. Hopefully this will help you.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["welcome to stackoverflow my friend", 
          "my friend, don't worry, you can get help from stackoverflow"]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(corpus)
print(matrix)

As we know that fit_transform() returns a tf-idf-weighted document-term matrix.

The print() statement outputs the following:

  (0, 2)    0.379303492809
  (0, 6)    0.379303492809
  (0, 7)    0.379303492809
  (0, 8)    0.533097824526
  (0, 9)    0.533097824526
  (1, 3)    0.342619853089
  (1, 5)    0.342619853089
  (1, 4)    0.342619853089
  (1, 0)    0.342619853089
  (1, 11)   0.342619853089
  (1, 10)   0.342619853089
  (1, 1)    0.342619853089
  (1, 2)    0.243776847332
  (1, 6)    0.243776847332
  (1, 7)    0.243776847332

So, how can we interpret this matrix? You can see a tuple (x, y) and a value in each row. Here the tuple represents, document no. (in this case sentence no.) and feature no.

To better understand, lets print the list of features (in our case, features are words) and their index.

for i, feature in enumerate(vectorizer.get_feature_names()):
    print(i, feature)

It outputs:

0 can
1 don
2 friend
3 from
4 get
5 help
6 my
7 stackoverflow
8 to
9 welcome
10 worry
11 you

So, welcome to stackoverflow my friend sentence is transformed to the following.

(0, 2)  0.379303492809
(0, 6)  0.379303492809
(0, 7)  0.379303492809
(0, 8)  0.533097824526
(0, 9)  0.533097824526

For example, the first two row values can be interpreted as follows.

0 = sentence no.
2 = word index (index of the word `friend`)
0.379303492809 = tf-idf weight

0 = sentence no.
6 = word index (index of the word `my`)
0.379303492809 = tf-idf weight

From the tf-idf values, you can see, the words welcome and to should rank higher than the other words in sentence 1.

You can extend this example to search for the rank of a given word in a particular sentence or document to fulfill your need.

10,900

Author by

Siddharth Sharma

Updated on June 27, 2022

Comments

Siddharth Sharma almost 2 years

I have 100 documents(Each document is a simple list of words in that document). Now I want to create a TF-IDF matrix so that I can create a small word search by rank. I tried it using a tfidfVectorizer but got lost in the syntax. Any help would be much appreciated. Regards.

Edit: I converted the lists into strings and added them into a parent list:

vectorizer = TfidfVectorizer(vocabulary=word_set)
matrix = vectorizer.fit_transform(doc_strings)
print(matrix)

Here word_set is the set of possible distinct words and doc_strings is a list that contains each document as a string; However when I print the matrix I get output as below :

  (0, 839)  0.299458532286
  (0, 710)  0.420878518454
  (0, 666)  0.210439259227
  (0, 646)  0.149729266143
  (0, 550)  0.210439259227
  (0, 549)  0.210439259227
  (0, 508)  0.210439259227
  (0, 492)  0.149729266143
  (0, 479)  0.149729266143
  (0, 425)  0.149729266143
  (0, 401)  0.210439259227
  (0, 332)  0.210439259227
  (0, 310)  0.210439259227
  (0, 253)  0.149729266143
  (0, 216)  0.210439259227
  (0, 176)  0.149729266143
  (0, 122)  0.149729266143
  (0, 119)  0.210439259227
  (0, 111)  0.149729266143
  (0, 46)   0.210439259227
  (0, 26)   0.210439259227
  (0, 11)   0.149729266143
  (0, 0)    0.210439259227
  (1, 843)  0.0144007295367
  (1, 842)  0.0288014590734
  (1, 25)   0.0144007295367
  (1, 24)   0.0144007295367
  (1, 23)   0.0432021886101
  (1, 22)   0.0144007295367
  (1, 21)   0.0288014590734
  (1, 20)   0.0288014590734
  (1, 19)   0.0288014590734
  (1, 18)   0.0432021886101
  (1, 17)   0.0288014590734
  (1, 16)   0.0144007295367
  (1, 15)   0.0144007295367
  (1, 14)   0.0432021886101
  (1, 13)   0.0288014590734
  (1, 12)   0.0144007295367
  (1, 11)   0.0102462376715
  (1, 10)   0.0144007295367
  (1, 9)    0.0288014590734
  (1, 8)    0.0288014590734
  (1, 7)    0.0144007295367
  (1, 6)    0.0144007295367
  (1, 5)    0.0144007295367
  (1, 4)    0.0144007295367
  (1, 3)    0.0144007295367
  (1, 2)    0.0288014590734
  (1, 1)    0.0144007295367

Is this correct and If so, how can I search for the rank of a given word in a particular document.