Calculate cosine similarity given 2 sentence strings
Solution 1
A simple pure-Python implementation would be:
import math
import re
from collections import Counter
WORD = re.compile(r"\w+")
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
text1 = "This is a foo bar sentence ."
text2 = "This sentence is similar to a foo bar sentence ."
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
cosine = get_cosine(vector1, vector2)
print("Cosine:", cosine)
Prints:
Cosine: 0.861640436855
The cosine formula used here is described here.
This does not include weighting of the words by tf-idf, but in order to use tf-idf, you need to have a reasonably large corpus from which to estimate tfidf weights.
You can also develop it further, by using a more sophisticated way to extract words from a piece of text, stem or lemmatise it, etc.
Solution 2
The short answer is "no, it is not possible to do that in a principled way that works even remotely well". It is an unsolved problem in natural language processing research and also happens to be the subject of my doctoral work. I'll very briefly summarize where we are and point you to a few publications:
Meaning of words
The most important assumption here is that it is possible to obtain a vector that represents each word in the sentence in quesion. This vector is usually chosen to capture the contexts the word can appear in. For example, if we only consider the three contexts "eat", "red" and "fluffy", the word "cat" might be represented as [98, 1, 87], because if you were to read a very very long piece of text (a few billion words is not uncommon by today's standard), the word "cat" would appear very often in the context of "fluffy" and "eat", but not that often in the context of "red". In the same way, "dog" might be represented as [87,2,34] and "umbrella" might be [1,13,0]. Imagening these vectors as points in 3D space, "cat" is clearly closer to "dog" than it is to "umbrella", therefore "cat" also means something more similar to "dog" than to an "umbrella".
This line of work has been investigated since the early 90s (e.g. this work by Greffenstette) and has yielded some surprisingly good results. For example, here is a few random entries in a thesaurus I built recently by having my computer read wikipedia:
theory -> analysis, concept, approach, idea, method
voice -> vocal, tone, sound, melody, singing
james -> william, john, thomas, robert, george, charles
These lists of similar words were obtained entirely without human intervention- you feed text in and come back a few hours later.
The problem with phrases
You might ask why we are not doing the same thing for longer phrases, such as "ginger foxes love fruit". It's because we do not have enough text. In order for us to reliably establish what X is similar to, we need to see many examples of X being used in context. When X is a single word like "voice", this is not too hard. However, as X gets longer, the chances of finding natural occurrences of X get exponentially slower. For comparison, Google has about 1B pages containing the word "fox" and not a single page containing "ginger foxes love fruit", despite the fact that it is a perfectly valid English sentence and we all understand what it means.
Composition
To tackle the problem of data sparsity, we want to perform composition, i.e. to take vectors for words, which are easy to obtain from real text, and to put the together in a way that captures their meaning. The bad news is nobody has been able to do that well so far.
The simplest and most obvious way is to add or multiply the individual word vectors together. This leads to undesirable side effect that "cats chase dogs" and "dogs chase cats" would mean the same to your system. Also, if you are multiplying, you have to be extra careful or every sentences will end up represented by [0,0,0,...,0], which defeats the point.
Further reading
I will not discuss the more sophisticated methods for composition that have been proposed so far. I suggest you read Katrin Erk's "Vector space models of word meaning and phrase meaning: a survey". This is a very good high-level survey to get you started. Unfortunately, is not freely available on the publisher's website, email the author directly to get a copy. In that paper you will find references to many more concrete methods. The more comprehensible ones are by Mitchel and Lapata (2008) and Baroni and Zamparelli (2010).
Edit after comment by @vpekar: The bottom line of this answer is to stress the fact that while naive methods do exist (e.g. addition, multiplication, surface similarity, etc), these are fundamentally flawed and in general one should not expect great performance from them.
Solution 3
I have similar solution but might be useful for pandas
import math
import re
from collections import Counter
import pandas as pd
WORD = re.compile(r"\w+")
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
df=pd.read_csv('/content/drive/article.csv')
df['vector1']=df['headline'].apply(lambda x: text_to_vector(x))
df['vector2']=df['snippet'].apply(lambda x: text_to_vector(x))
df['simscore']=df.apply(lambda x: get_cosine(x['vector1'],x['vector2']),axis=1)
Solution 4
Try this. Download the file 'numberbatch-en-17.06.txt' from https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.06.txt.gz and extract it. The function 'get_sentence_vector' uses a simple sum of word vectors. However it can be improved by using weighted sum where weights are proportional to Tf-Idf of each word.
import math
import numpy as np
std_embeddings_index = {}
with open('path/to/numberbatch-en-17.06.txt') as f:
for line in f:
values = line.split(' ')
word = values[0]
embedding = np.asarray(values[1:], dtype='float32')
std_embeddings_index[word] = embedding
def cosineValue(v1,v2):
"compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x = v1[i]; y = v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx*sumyy)
def get_sentence_vector(sentence, std_embeddings_index = std_embeddings_index ):
sent_vector = 0
for word in sentence.lower().split():
if word not in std_embeddings_index :
word_vector = np.array(np.random.uniform(-1.0, 1.0, 300))
std_embeddings_index[word] = word_vector
else:
word_vector = std_embeddings_index[word]
sent_vector = sent_vector + word_vector
return sent_vector
def cosine_sim(sent1, sent2):
return cosineValue(get_sentence_vector(sent1), get_sentence_vector(sent2))
I did run for the given sentences and found the following results
s1 = "This is a foo bar sentence ."
s2 = "This sentence is similar to a foo bar sentence ."
s3 = "What is this string ? Totally not related to the other two lines ."
print cosine_sim(s1, s2) # Should give high cosine similarity
print cosine_sim(s1, s3) # Shouldn't give high cosine similarity value
print cosine_sim(s2, s3) # Shouldn't give high cosine similarity value
0.9851735249068168
0.6570885718962608
0.6589335425458225
Solution 5
Well, if you are aware of word embeddings like Glove/Word2Vec/Numberbatch, your job is half done. If not let me explain how this can be tackled. Convert each sentence into word tokens, and represent each of these tokens as vectors of high dimension (using the pre-trained word embeddings, or you could train them yourself even!). So, now you just don't capture their surface similarity but rather extract the meaning of each word which comprise the sentence as a whole. After this calculate their cosine similarity and you are set.
Related videos on Youtube
Comments
-
alvas almost 2 years
From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate cosine similarity between 2 strings?
s1 = "This is a foo bar sentence ." s2 = "This sentence is similar to a foo bar sentence ." s3 = "What is this string ? Totally not related to the other two lines ." cosine_sim(s1, s2) # Should give high cosine similarity cosine_sim(s1, s3) # Shouldn't give high cosine similarity value cosine_sim(s2, s3) # Shouldn't give high cosine similarity value
-
static_rtti almost 9 yearsI don't have the answer, but something like word2vec (code.google.com/p/word2vec) would probably be a good start if you want meaningful results.
-
mithunpaul about 6 years@static_rtti word2vec has got nothing to do with cosine similarity. Thats about embeddings. Here he is giving the two strings from which he wants to calculate cosine similarity.
-
mie.ppa over 4 yearsIf anyone is looking for semantic similarity, gensim can be helpful.
-
-
mbatchkarov over 11 yearsHow about "Felines feed on mice" and "Rodents are often eaten by cats"? You code incorrectly returns 0.
-
vpekar over 11 yearsSurely, an SO question is not the place to definitively solve the problem of modelling semantic similarity of sentences. The question is about measuring (surface) similarity between two bits of text, that's what the code does.
-
vpekar over 11 yearsThe code returns 0, correctly, because it measures surface similarity of two texts, it does not measure meaning as such.
-
mbatchkarov over 11 yearsI agree with your first point and disagree with the second. SO is not for long and theoretical scientific discussions, which is why I refrained from talking about the technical side. While your answer is to the point and correct, the question is clearly not about surface similarity. Please have another look at the example sentences in the question.
-
alvas over 11 yearsI think @vpekar's approach is more like a basic lesk where the surface words are considered as without semantic knowledge.
-
vpekar over 11 years@2er0 Did you mean to ask about how to use external semantic knowledge when measuring similarity between two sentences?
-
alvas over 11 yearsnot really but just a generic question on whether
cosine similarity
can be achieved without external library. But it's good to note @mbatchkarov's advice on more realistic setup for cosine similarity. -
vpekar over 11 yearsYou asked specifically about cosine. I answered specifically that question. The cosine is implemented as "realistically" as you can possibly get. If you mean to say, "how to do better than cosine to measure similarity", then it's a different question.
-
JesseBuesking about 11 yearsOut of curiosity, what approach did you use to build the thesaurus?
-
mbatchkarov about 11 yearsIt's a distributional thesaurus, built with Byblo. In this particular instantiation each token has as features the other tokens that occurs in window of 5 words around it in all of Wikipedia, and similarity is calculated based on these features. We've built other thesauri where the features are the other words the target word has grammatical relations with. This generally works better, but requires at least a partial parse of the corpus, which takes a long time.
-
user1839641 about 10 years@vpekar Does calculating cosine similarity require that both vectors' dimension should be the same? Can you really calculate the cosine of (0,1) and (0,1,0)? Also, this similarity score doesn't catch the differences of word sequence. If text1='This is a foo bar sentence .' and text2 = 'is This a foo bar sentence .', the cosine score is 1.
-
vpekar over 8 yearsThe cosine similarity assumes that if a target word did not co-occur with a context word, then the corresponding value in the feature vector is 0. So the implementation above does not compare vectors of different dimensions. Re. 2nd question - yes, the cosine ignores the sequence of words in the sentence.
-
Oleg Afanasyev about 6 yearsWhy cannot we use RNN to handle composition? This way we'd take into account words order as well as their individual meaning.
-
mbatchkarov about 6 yearsSure, you can. This answer is now over 5 years old. At the time RNNs were just starting to get big (again) and weren't really on my radar
-
user1531248 over 5 yearsWhat to do after getting each word embedding for sentence 1 (6 * 100) and sentence 2 (4*100) by assuming 6 word in first sentence & 100 is the embedding size and 4 word in second sentence with 100D is embedding. And can i do sum of word vector or guide me if you have any?
-
user1531248 over 5 years@OlegAfanasyev - You mean to say AutoEncoder with CNN or RNNs? I am going to implement, could you provide any guidance if anything is required from as usual?
-
desertnaut over 4 yearsThe last 2 paper links are now dead
-
Speeeddy about 3 yearsGoogle has about 1B pages containing the word "fox" and not a single page containing "ginger foxes love fruit" -- now it does :D