Matching words and vectors in gensim Word2Vec model

16,670

Solution 1

So I found an easy way to do this, where nmodel is the name of your model.

#zip the two lists containing vectors and words
zipped = zip(nmodel.wv.index2word, nmodel.wv.syn0)

#the resulting list contains `(word, wordvector)` tuples. We can extract the entry for any `word` or `vector` (replace with the word/vector you're looking for) using a list comprehension:
wordresult = [i for i in zipped if i[0] == word]
vecresult = [i for i in zipped if i[1] == vector]

This is based on the gensim code. For older versions of gensim, you might need to drop the wv after the model.

Solution 2

I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use model.index2word which is simply the list of words in the right order !

This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441

Solution 3

If all you want to do is map a word to a vector, you can simply use the [] operator, e.g. model["hello"] will give you the vector corresponding to hello.

If you need to recover a word from a vector you could loop through your list of vectors and check for a match, as you propose. However, this is inefficient and not pythonic. A convenient solution is to use the similar_by_vector method of the word2vec model, like this:

import gensim

documents = [['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

model = gensim.models.Word2Vec(documents, min_count=1)
print model.similar_by_vector(model["survey"], topn=1)

which outputs:

[('survey', 1.0000001192092896)]

where the number represents the similarity.

However, this method is still inefficient, as it still has to scan all of the word vectors to search for the most similar one. The best solution to your problem is to find a way to keep track of your vectors during the clustering process so you don't have to rely on expensive reverse mappings.

Share:
16,670
patrick
Author by

patrick

Updated on June 06, 2022

Comments

  • patrick
    patrick about 2 years

    I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.

    As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z], I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab and the word vectors through model.syn0. But I could not find a location where these are explicitly matched.

    This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!

    Problem:

    Match words to embedding vectors created by Word2Vec () -- how do I do it?

    My approach:

    After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab() phase) to the vector matrix outputted as model.syn0. Thus

    for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
        print i
        word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
        wordvector= newmod.syn0[i] #get the vector with the corresponding index
        print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True
    
    • Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?

    • Does this even get me correct results?

    *My code to create the word vectors:

    model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
            
    model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings
    
    model.save("newmodel")
    

    I found this question which is similar but has not really been answered.