How to calculate the sentence similarity using word2vec model of gensim with python

python gensim word2vec

118,575

Solution 1

This is actually a pretty challenging problem that you are asking. Computing sentence similarity requires building a grammatical model of the sentence, understanding equivalent structures (e.g. "he walked to the store yesterday" and "yesterday, he walked to the store"), finding similarity not just in the pronouns and verbs but also in the proper nouns, finding statistical co-occurences / relationships in lots of real textual examples, etc.

The simplest thing you could try -- though I don't know how well this would perform and it would certainly not give you the optimal results -- would be to first remove all "stop" words (words like "the", "an", etc. that don't add much meaning to the sentence) and then run word2vec on the words in both sentences, sum up the vectors in the one sentence, sum up the vectors in the other sentence, and then find the difference between the sums. By summing them up instead of doing a word-wise difference, you'll at least not be subject to word order. That being said, this will fail in lots of ways and isn't a good solution by any means (though good solutions to this problem almost always involve some amount of NLP, machine learning, and other cleverness).

So, short answer is, no, there's no easy way to do this (at least not to do it well).

Solution 2

Since you're using gensim, you should probably use it's doc2vec implementation. doc2vec is an extension of word2vec to the phrase-, sentence-, and document-level. It's a pretty simple extension, described here

http://cs.stanford.edu/~quocle/paragraph_vector.pdf

Gensim is nice because it's intuitive, fast, and flexible. What's great is that you can grab the pretrained word embeddings from the official word2vec page and the syn0 layer of gensim's Doc2Vec model is exposed so that you can seed the word embeddings with these high quality vectors!

GoogleNews-vectors-negative300.bin.gz (as linked in Google Code)

I think gensim is definitely the easiest (and so far for me, the best) tool for embedding a sentence in a vector space.

There exist other sentence-to-vector techniques than the one proposed in Le & Mikolov's paper above. Socher and Manning from Stanford are certainly two of the most famous researchers working in this area. Their work has been based on the principle of compositionally - semantics of the sentence come from:

1. semantics of the words

2. rules for how these words interact and combine into phrases

They've proposed a few such models (getting increasingly more complex) for how to use compositionality to build sentence-level representations.

2011 - unfolding recursive autoencoder (very comparatively simple. start here if interested)

2012 - matrix-vector neural network

2013 - neural tensor network

2015 - Tree LSTM

his papers are all available at socher.org. Some of these models are available, but I'd still recommend gensim's doc2vec. For one, the 2011 URAE isn't particularly powerful. In addition, it comes pretrained with weights suited for paraphrasing news-y data. The code he provides does not allow you to retrain the network. You also can't swap in different word vectors, so you're stuck with 2011's pre-word2vec embeddings from Turian. These vectors are certainly not on the level of word2vec's or GloVe's.

Haven't worked with the Tree LSTM yet, but it seems very promising!

tl;dr Yeah, use gensim's doc2vec. But other methods do exist!

Solution 3

If you are using word2vec, you need to calculate the average vector for all words in every sentence/document and use cosine similarity between vectors:

import numpy as np
from scipy import spatial

index2word_set = set(model.wv.index2word)

def avg_feature_vector(sentence, model, num_features, index2word_set):
    words = sentence.split()
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    for word in words:
        if word in index2word_set:
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return feature_vec

Calculate similarity:

s1_afv = avg_feature_vector('this is a sentence', model=model, num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model, num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)

> 0.915479828613

Solution 4

you can use Word Mover's Distance algorithm. here is an easy description about WMD.

#load word2vec model, here GoogleNews is used
model = gensim.models.KeyedVectors.load_word2vec_format('../GoogleNews-vectors-negative300.bin', binary=True)
#two sample sentences 
s1 = 'the first sentence'
s2 = 'the second text'

#calculate distance between two sentences using WMD algorithm
distance = model.wmdistance(s1, s2)

print ('distance = %.3f' % distance)

P.s.: if you face an error about import pyemd library, you can install it using following command:

pip install pyemd

Solution 5

Once you compute the sum of the two sets of word vectors, you should take the cosine between the vectors, not the diff. The cosine can be computed by taking the dot product of the two vectors normalized. Thus, the word count is not a factor.

View more solutions

118,575

Author by

zhfkt

Updated on July 08, 2022

Comments

zhfkt almost 2 years
According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words.

e.g.
```
trained_model.similarity('woman', 'man') 
0.73723527
```
However, the word2vec model fails to predict the sentence similarity. I find out the LSI model with sentence similarity in gensim, but, which doesn't seem that can be combined with word2vec model. The length of corpus of each sentence I have is not very long (shorter than 10 words). So, are there any simple ways to achieve the goal?
- Emiel almost 10 years
  
  There is an ACL tutorial discussing this issue (among other things): youtube.com/watch?v=_ASOqXiWBVo&feature=youtu.be
- kampta over 8 years
  
  You can now use gensim's doc2vec and get sentence similarity from the same module
- Ian_De_Oliveira almost 6 years
  
  @kampta . Hi would you suggest any post that shows the implementation?
- kampta almost 6 years
  
  Checkout rare-technologies.com/doc2vec-tutorial
zhfkt about 10 years

I think you are right. The simplest method is to accumulate all vectors of words in one sentence and find the difference between the sums.By the way, will this simple method be effected by the word count? Because the more words in one sentence,the more histogram will be summed up.
Michael Aaron Safyan about 10 years

@zhfkt, most likely, yes. So you may need to divide by the number of words or some such to try to factor that out. Either way, any heuristic like this will be severely flawed.
Καrτhικ almost 9 years

What is the difference between taking the mean of vectors vs. adding them to create a sentence vector?
lechatpito almost 9 years

The difference is that the vector size is fixed for all sentences
Vladislavs Dovgalecs over 8 years

It is worth mentioning shortly how the presented algorithm works. You basically add a unique "token" to every utterance and compute the word2vec vectors. At the end you will get the word vectors for each of your word in the corpus (provided you ask for all words, also the unique ones). Each unique "token" in the utterance will represent that utterance. There is some controversy about results presented in the paper but that is another story.
theteddyboy about 7 years

Could you please give more explanation on index2word_set and model.index2word? Thank you.
Quetzalcoatl almost 7 years

very nice paper. note: the link to the SIF implementation requires writing the get_word_frequency() method which can be easily accomplished by using Python's Counter() and returning a dict with keys: unique words w, values: #w/#total doc len
gented almost 7 years

Notice that calculating the "average vector" is as much of an arbitrary choice as not calculating it at all.
Simon Hessner about 6 years

Do you have more information on how to initialize the doc2vec model with pre-trained word2vec values?
krinker almost 6 years

I used WMD before and it works quiet well, however it would choke on large corpus. Try SoftCosineSimilarity. Also found in gensim (twitter.com/gensim_py/status/963382840934195200)
dcsan over 5 years

can you provide a bit of pseudocode on how to do this (I'm not using gensim/python)
Amartya over 5 years

WMD is not very fast however when you want to query a corpus.
Wok over 5 years

There is no difference as long as you use the cosine similarity. @lechatpito Nothing to do with vector size. The vectors are summed, not concatenated.
Matt L. almost 5 years

See also datascience.stackexchange.com/questions/23969/…
Asim over 4 years

I am astonished why this isn't the top answer, it works quite well and doesn't have sequence problem which averaging method has.
iRunner over 4 years

This is the answer i was looking for .Solved my issue. Thanks for the solution