Interpreting the sum of TF-IDF scores of words across documents

python statistics nlp tf-idf gensim

10,514

Solution 1

The interpretation of TF-IDF in corpus is the highest TF-IDF in corpus for a given term.

Find the Top Words in corpus_tfidf.

    topWords = {}
    for doc in corpus_tfidf:
        for iWord, tf_idf in doc:
            if iWord not in topWords:
                topWords[iWord] = 0

            if tf_idf > topWords[iWord]:
                topWords[iWord] = tf_idf

    for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
        print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
        if i == 6: break

Output comparison cart:
NOTE: Could'n use gensim, to create a matching dictionary with corpus_tfidf.
Can only display Word Indizies.

Question tfidf_saliency   topWords(corpus_tfidf)  Other TF-IDF implentation  
---------------------------------------------------------------------------  
1: Word(7)   0.121        1: Word(13)    0.640    1: paths         0.376019  
2: Word(8)   0.111        2: Word(27)    0.632    2: intersection  0.376019  
3: Word(26)  0.108        3: Word(28)    0.632    3: survey        0.366204  
4: Word(29)  0.100        4: Word(8)     0.628    4: minors        0.366204  
5: Word(9)   0.090        5: Word(29)    0.628    5: binary        0.300815  
6: Word(14)  0.087        6: Word(11)    0.544    6: generation    0.300815

The calculation of TF-IDF takes always the corpus in account.

Tested with Python:3.4.2

Solution 2

This is a great discussion. Thanks for starting this thread. The idea of including document length by @avip seems interesting. Will have to experiment and check on the results. In the meantime let me try asking the question a little differently. What are we trying to interpret when querying for TF-IDF relevance scores ?

Possibly trying to understand the word relevance at the document level
Possibly trying to understand the word relevance per Class

Possibly trying to understand the word relevance overall ( in the whole corpus )

 # # features, corpus = 6 documents of length 3
 counts = [[3, 0, 1],
           [2, 0, 0],
           [3, 0, 0],
           [4, 0, 0],
           [3, 2, 0],
           [3, 0, 2]]
 from sklearn.feature_extraction.text import TfidfTransformer
 transformer = TfidfTransformer(smooth_idf=False)
 tfidf = transformer.fit_transform(counts)
 print(tfidf.toarray())

 # lambda for basic stat computation
 summarizer_default = lambda x: np.sum(x, axis=0)
 summarizer_mean = lambda x: np.mean(x, axis=0)

 print(summarizer_default(tfidf))
 print(summarizer_mean(tfidf))

Result:

# Result post computing TF-IDF relevance scores
array([[ 0.81940995,  0.        ,  0.57320793],
           [ 1.        ,  0.        ,  0.        ],
           [ 1.        ,  0.        ,  0.        ],
           [ 1.        ,  0.        ,  0.        ],
           [ 0.47330339,  0.88089948,  0.        ],
           [ 0.58149261,  0.        ,  0.81355169]])

# Result post aggregation (Sum, Mean) 
[[ 4.87420595  0.88089948  1.38675962]]
[[ 0.81236766  0.14681658  0.2311266 ]]

If we observe closely, we realize the the feature1 witch occurred in all the document is not ignored completely because the sklearn implementation of idf = log [ n / df(d, t) ] + 1. +1 is added so that the important word which just so happens to occur in all document is not ignored. E.g. the word 'bike' occurring very frequently in classifying a particular document as 'motorcyle' (20_newsgroup dataset).

Now in-regards to the first 2 questions, one is trying to interpret and understand the top common features that might be occurring in the document. In that case, aggregating in some form including all possible occurrence of the word in a doc is not taking anything away even mathematically. IMO such a query is very useful exploring the dataset and helping understanding what the dataset is about. The logic might be applied to vectorizing using Hashing as well.

relevance_score = mean(tf(t,d) * idf(t,d)) = mean( (bias + inital_wt * F(t,d) / max{F(t',d)}) * log(N/df(d, t)) + 1 ))
Question3 is very important as it might as well be contributing to features being selected for building a predictive model. Just using TF-IDF scores independently for feature selection might be misleading at multiple level. Adopting a more theoretical statistical test such as 'chi2' couple with TF-IDF relevance scores might be a better approach. Such statistical test also evaluates the importance of the feature in relation to the respective target class.

And of-course combining such interpretation with the model's learned feature weights would be very helpful in understanding the importance of text derived features completely.

** The problem is a little more complex to cover in detail here. But, hoping the above helps. What do others feel ?

Reference: https://arxiv.org/abs/1707.05261

Solution 3

There is two context that saliency can be calculated in them.

saliency in the corpus
saliency in a single document

saliency in the corpus simply can be calculated by counting the appearances of particular word in corpus or by inverse of the counting of the documents that word appears in (IDF=Inverted Document Frequency). Because the words that hold the specific meaning does not appear in everywhere.

saliency in the document is calculated by tf_idf. Because that is composed of two kinds of information. Global information (corpus-based) and local information (document-based).Claiming that "the word with larger in-document frequency is more important in current document" is not completely true or false because it depends on the global saliency of word. In a particular document you have many words like "it, is, am, are ,..." with large frequencies. But these word is not important in any document and you can take them as stop words!

---- edit ---

The denominator (=len(corpus_tfidf)) is a constant value and could be ignored if you want to deal with ordinality rather than cardinality of measurement. On the other hand we know that IDF means Inverted Document Freqeuncy so IDF can be reoresented by 1/DF. We know that the DF is a corpus level value and TF is document level-value. TF-IDF Summation turns document-level TF into Corpus-level TF. Indeed the summation is equal to this formula:

count ( word ) / count ( documents contain word)

This measurement can be called inverse-scattering value. When the value goes up means the words is gathered into smaller subset of documents and vice versa.

I believe that this formula is not so useful.

10,514

Author by

alvas

食飽未?

Updated on July 02, 2022

Comments

alvas almost 2 years

First let's extract the TF-IDF scores per term per document:

from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

Printing it out:

for doc in corpus_tfidf:
    print doc

[out]:

[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]

If we want to find the "saliency" or "importance" of the words within this corpus, can we simple do the sum of the tf-idf scores across all documents and divide it by the number of documents? I.e.

>>> tfidf_saliency = Counter()
>>> for doc in corpus_tfidf:
...     for word, score in doc:
...         tfidf_saliency[word] += score / len(corpus_tfidf)
... 
>>> tfidf_saliency
Counter({7: 0.12182694202050007, 8: 0.11121194156107769, 26: 0.10886469856464989, 29: 0.10093919463036093, 9: 0.09022272408985754, 14: 0.08705221175200946, 25: 0.08482488519466996, 6: 0.08143359568202602, 10: 0.07480097322359022, 12: 0.07480097322359022, 4: 0.07411881371164887, 13: 0.07117278898823597, 5: 0.07104525967490458, 27: 0.07027283689263066, 28: 0.07027283689263066, 11: 0.060487023243988705, 15: 0.055997035904387725, 16: 0.055997035904387725, 21: 0.05389680556362955, 22: 0.05389680556362955, 23: 0.05389680556362955, 24: 0.05389680556362955, 17: 0.048785635947490406, 18: 0.048785635947490406, 19: 0.048785635947490406, 20: 0.048785635947490406, 0: 0.04778910634833961, 1: 0.04778910634833961, 2: 0.04778910634833961, 3: 0.04778910634833961, 30: 0.045480127933079706, 31: 0.045480127933079706, 32: 0.045480127933079706, 33: 0.045480127933079706, 34: 0.045480127933079706})

Looking at the output, could we assume that the most "prominent" word in the corpus is:

>>> dictionary[7]
u'system'
>>> dictionary[8]
u'survey'
>>> dictionary[26]
u'graph'

If so, what is the mathematical interpretation of the sum of TF-IDF scores of words across documents?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Finding topics of an unseen document via Gensim

Load PreComputed Vectors Gensim

Does gensim.corpora.Dictionary have term frequency saved?

Does NLTK have TF-IDF implemented?

Python Gensim: how to calculate document similarity using the LDA model?

How to extract phrases from corpus using gensim

TFIDF for Large Dataset

Doc2Vec Get most similar documents

How to get tfidf with pandas dataframe?