Python Gensim: how to calculate document similarity using the LDA model?

26,329

Solution 1

Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query.

dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("model.lda") #result from running online lda (training)

index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")

docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]

sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims

Your similarity score between all documents residing in the corpus and the document that was used as a query will be the second index of every sim for sims.

Solution 2

Depends what similarity metric you want to use.

Cosine similarity is universally useful & built-in:

sim = gensim.matutils.cossim(vec_lda1, vec_lda2)

Hellinger distance is useful for similarity between probability distributions (such as LDA topics):

import numpy as np
dense1 = gensim.matutils.sparse2full(lda_vec1, lda.num_topics)
dense2 = gensim.matutils.sparse2full(lda_vec2, lda.num_topics)
sim = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum())

Solution 3

Provided answers are good, but they aren't very beginner-friendly. I want to start from training the LDA model and calculate cosine similarity.

Training model part:

docs = ["latent Dirichlet allocation (LDA) is a generative statistical model", 
        "each document is a mixture of a small number of topics",
        "each document may be viewed as a mixture of various topics"]

# Convert document to tokens
docs = [doc.split() for doc in docs]

# A mapping from token to id in each document
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)

# Representing the corpus as a bag of words
corpus = [dictionary.doc2bow(doc) for doc in docs]

# Training the model
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)

For extracting the probability assigned to each topic for a document, there are generally two ways. I provide here the both:

# Some preprocessing for documents like the training the model
test_doc = ["LDA is an example of a topic model",
            "topic modelling refers to the task of identifying topics"]
test_doc = [doc.split() for doc in test_doc]
test_corpus = [dictionary.doc2bow(doc) for doc in test_doc]

# Method 1
from gensim.matutils import cossim
doc1 = model.get_document_topics(test_corpus[0], minimum_probability=0)
doc2 = model.get_document_topics(test_corpus[1], minimum_probability=0)
print(cossim(doc1, doc2))

# Method 2
doc1 = model[test_corpus[0]]
doc2 = model[test_corpus[1]]
print(cossim(doc1, doc2))

output:

#Method 1
0.8279631530869963

#Method 2
0.828066885140262

As you can see both of the methods are generally the same, the difference is in the probabilities returned in the 2nd method sometimes doesn't add up to one as discussed here. For large corpus, the possibility vector could be given by passing the whole corpus:

#Method 1
possibility_vector = model.get_document_topics(test_corpus, minimum_probability=0)
#Method 2
possiblity_vector = model[test_corpus]

NOTE: The sum of probability assigned to each topic in a document may become a bit higher than 1 or in some cases a bit less than 1. That is because of the floating-point arithmetic rounding errors.

Share:
26,329
Admin
Author by

Admin

Updated on June 11, 2020

Comments

  • Admin
    Admin almost 4 years

    I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on. After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a hint? Thanks!