How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?
Solution 1
You can use word2vec to get most similar terms from the top n topics abstracted using LDA.
LDA Output
Create a dictionary of bi-grams using topics abstracted (for ex:-san_francisco)
Then, do word2vec to get most similar words (uni-grams,bi-grams etc)
Word and Cosine distance
los_angeles (0.666175)
golden_gate (0.571522)
oakland (0.557521)
check https://code.google.com/p/word2vec/ (From words to phrases and beyond)
Solution 2
Given I have a dict called docs
, containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this:
from nltk.util import ngrams
for doc in docs:
docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)]
Then you pass the values of this dict to the LDA model as a corpus. Bigrams joined by underscores are thus treated as single tokens.
Thomas N T
Updated on June 18, 2022Comments
-
Thomas N T almost 2 years
LDA Original Output
Uni-grams
topic1 -scuba,water,vapor,diving
topic2 -dioxide,plants,green,carbon
Required Output
Bi-gram topics
topic1 -scuba diving,water vapor
topic2 -green plants,carbon dioxide
Any idea?