Interpreting the sum of TF-IDF scores of words across documents

10,514

Solution 1

The interpretation of TF-IDF in corpus is the highest TF-IDF in corpus for a given term.

Find the Top Words in corpus_tfidf.

    topWords = {}
    for doc in corpus_tfidf:
        for iWord, tf_idf in doc:
            if iWord not in topWords:
                topWords[iWord] = 0

            if tf_idf > topWords[iWord]:
                topWords[iWord] = tf_idf

    for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
        print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
        if i == 6: break

Output comparison cart:
NOTE: Could'n use gensim, to create a matching dictionary with corpus_tfidf.
Can only display Word Indizies.

Question tfidf_saliency   topWords(corpus_tfidf)  Other TF-IDF implentation  
---------------------------------------------------------------------------  
1: Word(7)   0.121        1: Word(13)    0.640    1: paths         0.376019  
2: Word(8)   0.111        2: Word(27)    0.632    2: intersection  0.376019  
3: Word(26)  0.108        3: Word(28)    0.632    3: survey        0.366204  
4: Word(29)  0.100        4: Word(8)     0.628    4: minors        0.366204  
5: Word(9)   0.090        5: Word(29)    0.628    5: binary        0.300815  
6: Word(14)  0.087        6: Word(11)    0.544    6: generation    0.300815  

The calculation of TF-IDF takes always the corpus in account.

Tested with Python:3.4.2

Solution 2

This is a great discussion. Thanks for starting this thread. The idea of including document length by @avip seems interesting. Will have to experiment and check on the results. In the meantime let me try asking the question a little differently. What are we trying to interpret when querying for TF-IDF relevance scores ?

  1. Possibly trying to understand the word relevance at the document level
  2. Possibly trying to understand the word relevance per Class
  3. Possibly trying to understand the word relevance overall ( in the whole corpus )

     # # features, corpus = 6 documents of length 3
     counts = [[3, 0, 1],
               [2, 0, 0],
               [3, 0, 0],
               [4, 0, 0],
               [3, 2, 0],
               [3, 0, 2]]
     from sklearn.feature_extraction.text import TfidfTransformer
     transformer = TfidfTransformer(smooth_idf=False)
     tfidf = transformer.fit_transform(counts)
     print(tfidf.toarray())
    
     # lambda for basic stat computation
     summarizer_default = lambda x: np.sum(x, axis=0)
     summarizer_mean = lambda x: np.mean(x, axis=0)
    
     print(summarizer_default(tfidf))
     print(summarizer_mean(tfidf))
    

Result:

# Result post computing TF-IDF relevance scores
array([[ 0.81940995,  0.        ,  0.57320793],
           [ 1.        ,  0.        ,  0.        ],
           [ 1.        ,  0.        ,  0.        ],
           [ 1.        ,  0.        ,  0.        ],
           [ 0.47330339,  0.88089948,  0.        ],
           [ 0.58149261,  0.        ,  0.81355169]])

# Result post aggregation (Sum, Mean) 
[[ 4.87420595  0.88089948  1.38675962]]
[[ 0.81236766  0.14681658  0.2311266 ]]

If we observe closely, we realize the the feature1 witch occurred in all the document is not ignored completely because the sklearn implementation of idf = log [ n / df(d, t) ] + 1. +1 is added so that the important word which just so happens to occur in all document is not ignored. E.g. the word 'bike' occurring very frequently in classifying a particular document as 'motorcyle' (20_newsgroup dataset).

  1. Now in-regards to the first 2 questions, one is trying to interpret and understand the top common features that might be occurring in the document. In that case, aggregating in some form including all possible occurrence of the word in a doc is not taking anything away even mathematically. IMO such a query is very useful exploring the dataset and helping understanding what the dataset is about. The logic might be applied to vectorizing using Hashing as well.

    relevance_score = mean(tf(t,d) * idf(t,d)) = mean( (bias + inital_wt * F(t,d) / max{F(t',d)}) * log(N/df(d, t)) + 1 ))

  2. Question3 is very important as it might as well be contributing to features being selected for building a predictive model. Just using TF-IDF scores independently for feature selection might be misleading at multiple level. Adopting a more theoretical statistical test such as 'chi2' couple with TF-IDF relevance scores might be a better approach. Such statistical test also evaluates the importance of the feature in relation to the respective target class.

And of-course combining such interpretation with the model's learned feature weights would be very helpful in understanding the importance of text derived features completely.

** The problem is a little more complex to cover in detail here. But, hoping the above helps. What do others feel ?

Reference: https://arxiv.org/abs/1707.05261

Solution 3

There is two context that saliency can be calculated in them.

  1. saliency in the corpus
  2. saliency in a single document

saliency in the corpus simply can be calculated by counting the appearances of particular word in corpus or by inverse of the counting of the documents that word appears in (IDF=Inverted Document Frequency). Because the words that hold the specific meaning does not appear in everywhere.

saliency in the document is calculated by tf_idf. Because that is composed of two kinds of information. Global information (corpus-based) and local information (document-based).Claiming that "the word with larger in-document frequency is more important in current document" is not completely true or false because it depends on the global saliency of word. In a particular document you have many words like "it, is, am, are ,..." with large frequencies. But these word is not important in any document and you can take them as stop words!

---- edit ---

The denominator (=len(corpus_tfidf)) is a constant value and could be ignored if you want to deal with ordinality rather than cardinality of measurement. On the other hand we know that IDF means Inverted Document Freqeuncy so IDF can be reoresented by 1/DF. We know that the DF is a corpus level value and TF is document level-value. TF-IDF Summation turns document-level TF into Corpus-level TF. Indeed the summation is equal to this formula:

count ( word ) / count ( documents contain word)

This measurement can be called inverse-scattering value. When the value goes up means the words is gathered into smaller subset of documents and vice versa.

I believe that this formula is not so useful.

Share:
10,514
alvas
Author by

alvas

食飽未?

Updated on July 02, 2022

Comments

  • alvas
    alvas almost 2 years

    First let's extract the TF-IDF scores per term per document:

    from gensim import corpora, models, similarities
    documents = ["Human machine interface for lab abc computer applications",
                  "A survey of user opinion of computer system response time",
                  "The EPS user interface management system",
                  "System and human system engineering testing of EPS",
                  "Relation of user perceived response time to error measurement",
                  "The generation of random binary unordered trees",
                  "The intersection graph of paths in trees",
                  "Graph minors IV Widths of trees and well quasi ordering",
                  "Graph minors A survey"]
    stoplist = set('for a of the and to in'.split())
    texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    

    Printing it out:

    for doc in corpus_tfidf:
        print doc
    

    [out]:

    [(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
    [(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
    [(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
    [(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
    [(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
    [(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
    [(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
    [(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
    [(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]
    

    If we want to find the "saliency" or "importance" of the words within this corpus, can we simple do the sum of the tf-idf scores across all documents and divide it by the number of documents? I.e.

    >>> tfidf_saliency = Counter()
    >>> for doc in corpus_tfidf:
    ...     for word, score in doc:
    ...         tfidf_saliency[word] += score / len(corpus_tfidf)
    ... 
    >>> tfidf_saliency
    Counter({7: 0.12182694202050007, 8: 0.11121194156107769, 26: 0.10886469856464989, 29: 0.10093919463036093, 9: 0.09022272408985754, 14: 0.08705221175200946, 25: 0.08482488519466996, 6: 0.08143359568202602, 10: 0.07480097322359022, 12: 0.07480097322359022, 4: 0.07411881371164887, 13: 0.07117278898823597, 5: 0.07104525967490458, 27: 0.07027283689263066, 28: 0.07027283689263066, 11: 0.060487023243988705, 15: 0.055997035904387725, 16: 0.055997035904387725, 21: 0.05389680556362955, 22: 0.05389680556362955, 23: 0.05389680556362955, 24: 0.05389680556362955, 17: 0.048785635947490406, 18: 0.048785635947490406, 19: 0.048785635947490406, 20: 0.048785635947490406, 0: 0.04778910634833961, 1: 0.04778910634833961, 2: 0.04778910634833961, 3: 0.04778910634833961, 30: 0.045480127933079706, 31: 0.045480127933079706, 32: 0.045480127933079706, 33: 0.045480127933079706, 34: 0.045480127933079706})
    

    Looking at the output, could we assume that the most "prominent" word in the corpus is:

    >>> dictionary[7]
    u'system'
    >>> dictionary[8]
    u'survey'
    >>> dictionary[26]
    u'graph'
    

    If so, what is the mathematical interpretation of the sum of TF-IDF scores of words across documents?