Topic distribution: How do we see which document belong to which topic after doing LDA in python

python nltk lda gensim

25,302

Solution 1

Using the probabilities of the topics, you can try to set some threshold and use it as a clustering baseline, but i am sure there are better ways to do clustering than this 'hacky' method.

from gensim import corpora, models, similarities
from itertools import chain

""" DEMO """
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]

# Create Dictionary.
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=3, \
                               update_every=1, chunksize=10000, passes=1)

# Prints the topics.
for top in lda.print_topics():
  print top
print

# Assigns the topics to the documents in corpus
lda_corpus = lda[mm]

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print threshold
print

cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]

print cluster1
print cluster2
print cluster3

[out]:

0.131*trees + 0.121*graph + 0.119*system + 0.115*user + 0.098*survey + 0.082*interface + 0.080*eps + 0.064*minors + 0.056*response + 0.056*computer
0.171*time + 0.171*user + 0.170*response + 0.082*survey + 0.080*computer + 0.079*system + 0.050*trees + 0.042*graph + 0.040*minors + 0.040*human
0.155*system + 0.150*human + 0.110*graph + 0.107*minors + 0.094*trees + 0.090*eps + 0.088*computer + 0.087*interface + 0.040*survey + 0.028*user

0.333333333333

['The EPS user interface management system', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors A survey']
['A survey of user opinion of computer system response time', 'Relation of user perceived response time to error measurement']
['Human machine interface for lab abc computer applications', 'System and human system engineering testing of EPS', 'Graph minors IV Widths of trees and well quasi ordering']

Just to make it clearer:

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = []
for doc in lda_corpus
    for topic in doc:
        for topic_id, score in topic:
            scores.append(score)
threshold = sum(scores)/len(scores)

The above code is sum the score of all words and in all topics for all documents. Then normalize the sum by the number of scores.

Solution 2

If you want to use the trick of

cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]

in the previous answer by alvas, make sure to set minimum_probability=0 in LdaModel

gensim.models.ldamodel.LdaModel(corpus,
            num_topics=num_topics, id2word = dictionary,
            passes=2, minimum_probability=0)

Otherwise the dimension of lda_corpus and documents may not agree since gensim will suppress any corpus with probability lower than minimum_probability.

An alternative way to group documents into topics is to assign topics according to the maximum probability

    lda_corpus = [max(prob,key=lambda y:y[1])
                    for prob in lda[mm] ]
    playlists = [[] for i in xrange(topic_num])]
    for i, x in enumerate(lda_corpus):
        playlists[x[0]].append(documents[i])

Note lda[mm] is roughly speaking a list of lists, or 2D matrix. The number of rows is the number of documents and the number of columns is the number of topics. Each matrix element is a tuple of the form (3,0.82) for example. Here 3 refers to the topic index and 0.82 the corresponding probability to be of that topic. By default, minimum_probability=0.01 and any tuple with probability less than 0.01 is omitted in lda[mm]. You can set it to be 1/#topics if you use the grouping method with maximum probability.

Solution 3

lda_corpus[i][j] are of the form [(0,t1),(0,t2)...,(0,t10),....(n,t10)] where the 1st term denotes the document index and the 2nd term denotes the probability of the topic in that particular document.

25,302

Author by

jxn

Updated on August 29, 2020

Comments

jxn over 3 years

I am able to run the LDA code from gensim and got the top 10 topics with their respective keywords.

Now I would like to go a step further to see how accurate the LDA algo is by seeing which document they cluster into each topic. Is this possible in gensim LDA?

Basically i would like to do something like this, but in python and using gensim.

LDA with topicmodels, how can I see which topics different documents belong to?
jxn over 10 years

this looks like a good solution! Another solution i found was to use the topic distribution to do k-means clustering. as seen in this link stackoverflow.com/questions/6486738/… but i am not sure how to implement it. Would you know how to do it?
alvas over 10 years

i'm trying to re-implement brown (stackoverflow.com/questions/20998832/…) too, but given (topic,prob) tuples, you can try this script from stackoverflow.com/questions/20990538/…
dh762 over 10 years

How could you use more clusters, based on how many topics you have?
alvas over 10 years

that's the scary part, no one knows what is the best number of topics to set, no one knows the best number of clusters to exact. I'm no computer scientist but i'm sure there's someone who somehow determine the optimal number of topics/clusters.
dh762 over 10 years

I've gotten much better performance by removing unique words like in this question
alvas over 10 years

i'm confused??? didn't the code i have up there already remove words that occurred once?
alvas over 10 years

Ahhh, now i see, performance as in speed, not accuracy/precision =)
jxn over 9 years

can you explain this line of code more specifically? scores = list(chain(*[[score for topic,score in topic] \ for topic in [doc for doc in lda_corpus]])) threshold = sum(scores)/len(scores)
jxn over 9 years

And, how did you get the numbers for [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold] at the [0][1] part?
alvas over 9 years

The outer index is to access the topics numbers from the lda_corpus, the inner index is to access the topic score. Actually you should print it out for yourself try this print [i for i in lda_corpus] then [i[1] for i in lda_corpus, then try lda_corpus[0][1],
jxn over 9 years

would you know how a topic score is computed?
alvas over 9 years

Go through the materials from cs.princeton.edu/~blei/topicmodeling.html
jxn about 8 years

Yes, setting by maximum probability is what I thought about too after as well! Thanks for showing the implementation😀
Economist_Ayahuasca almost 8 years

Hey @nos, could you explain me what does the first part of the code do: in particular, [0][1] > threshold part? what do these number represent?
nos almost 8 years

@AndresAzqueta the elements of lda_corpus are of the form [(0, p0), (1, p1), ...], where the 1st number is the topic index and the 2nd number is the corresponding probability of the document belonging to that topic. If there is N topics, then that list contains N tuples. However, if minimum_probability is not 0, then the tuple with probability lower than minimum_probability is not included in that list.
Economist_Ayahuasca almost 8 years

Hey @nos, thanks very much for the answer. So if I have five topics, the series would be: [0][1] > threshold, [1][1] > threshold, [2][1] > threshold, [3][1] > threshold, [4][1] > threshold? Thanks
drevicko almost 8 years

@alvas I'd recommend using mallet with prior optimisation turned on (the default I think) and a large number of topics. This is effectively the same as using heirarchical (prior) topic models that essentially infer the number of topics (yes, they do exist), as many of the topics found by mallet end up with very few words assigned to them. btw: you can run mallet from gensim.
StatguyUser about 6 years

LDA gives overlapping clusters and not distinct clusters. stackoverflow.com/questions/49380258/…
Bhaskar Dhariyal over 5 years

here cluster implies topics?