How to get tfidf with pandas dataframe?

52,686

Solution 1

Scikit-learn implementation is really easy :

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])

There are plenty of parameters you can specify. See the documentation here

The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray()

In [44]: x.toarray()
Out[44]: 
array([[ 0.64612892,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.64612892,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.64612892,  0.38161415]])

Solution 2

A simple solution is to use texthero:

import texthero as hero
df['tfidf'] = hero.tfidf(df['sent'])
In [5]: df.head()
Out[5]:
   docId                         sent                                              tfidf
0      1   This is the first sentence  [0.3816141458138271, 0.6461289150464732, 0.381...
1      2  This is the second sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...
2      3   This is the third sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...
Share:
52,686
user1610952
Author by

user1610952

Updated on July 10, 2022

Comments

  • user1610952
    user1610952 almost 2 years

    I want to calculate tf-idf from the documents below. I'm using python and pandas.

    import pandas as pd
    df = pd.DataFrame({'docId': [1,2,3], 
                   'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})
    

    First, I thought I would need to get word_count for each row. So I wrote a simple function:

    def word_count(sent):
        word2cnt = dict()
        for word in sent.split():
            if word in word2cnt: word2cnt[word] += 1
            else: word2cnt[word] = 1
    return word2cnt
    

    And then, I applied it to each row.

    df['word_count'] = df['sent'].apply(word_count)
    

    But now I'm lost. I know there's an easy method to calculate tf-idf if I use Graphlab, but I want to stick with an open source option. Both Sklearn and gensim look overwhelming. What's the simplest solution to get tf-idf?

  • Clock Slave
    Clock Slave about 7 years
    Lets say I passed 100 to max_features parameter and the original vocabulary of the corpus is 1000. How do I get the names of the selected features and map them to the matrix produced?
  • arthur
    arthur about 7 years
    v.get_feature_names() will give you the list of feature names. v.vocabulary_ will give you a dict with feature names as keys and their index in the matrix produced as values.
  • Ch HaXam
    Ch HaXam over 6 years
    ja, but beware of printing the feature_names(). if the number of feature increased you will have memory issue.
  • user1098761
    user1098761 over 3 years
    It might be the best & and simplest way.