Doc2Vec Get most similar documents
You need to use infer_vector
to get a document vector of the new text - which does not alter the underlying model.
Here is how you do it:
tokens = "a new sentence to match".split()
new_vector = model.infer_vector(tokens)
sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity
Edit:
Here is an example of how the underlying model does not change after infer_vec
is called.
import numpy as np
words = "king queen man".split()
len_before = len(model.docvecs) #number of docs
#word vectors for king, queen, man
w_vec0 = model[words[0]]
w_vec1 = model[words[1]]
w_vec2 = model[words[2]]
new_vec = model.infer_vector(words)
len_after = len(model.docvecs)
print np.array_equal(model[words[0]], w_vec0) # True
print np.array_equal(model[words[1]], w_vec1) # True
print np.array_equal(model[words[2]], w_vec2) # True
print len_before == len_after #True
Clock Slave
Updated on July 09, 2022Comments
-
Clock Slave almost 2 years
I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the
Doc2Vec
model in gensim. My dataset is in the form of a pandas dataset which has each document stored as a string on each line. This is the code I have so farimport gensim, re import pandas as pd # TOKENIZER def tokenizer(input_string): return re.findall(r"[\w']+", input_string) # IMPORT DATA data = pd.read_csv('mp_1002_prepd.txt') data.columns = ['merged'] data.loc[:, 'tokens'] = data.merged.apply(tokenizer) sentences= [] for item_no, line in enumerate(data['tokens'].values.tolist()): sentences.append(LabeledSentence(line,[item_no])) # MODEL PARAMETERS dm = 1 # 1 for distributed memory(default); 0 for dbow cores = multiprocessing.cpu_count() size = 300 context_window = 50 seed = 42 min_count = 1 alpha = 0.5 max_iter = 200 # BUILD MODEL model = gensim.models.doc2vec.Doc2Vec(documents = sentences, dm = dm, alpha = alpha, # initial learning rate seed = seed, min_count = min_count, # ignore words with freq less than min_count max_vocab_size = None, # window = context_window, # the number of words before and after to be used as context size = size, # is the dimensionality of the feature vector sample = 1e-4, # ? negative = 5, # ? workers = cores, # number of cores iter = max_iter # number of iterations (epochs) over the corpus) # QUERY BASED DOC RANKING ??
The part where I am struggling is in finding documents that are most similar/relevant to the query. I used the
infer_vector
but then realised that it considers the query as a document, updates the model and returns the results. I tried using themost_similar
andmost_similar_cosmul
methods but I get words along with a similarity score(I guess) in return. What I want to do is when I enter a search string(a query), I should get the documents (ids) that are most relevant along with a similarity score(cosine etc). How do I get this part done? -
Clock Slave about 7 yearsare you sure that it doesn't update the model. The
infer_vector
method takes parameters likealpha
andmin_alpha
. I'm assuming they are learning rates. There's not much given in the documentation so I am not really sure if they are learning rates or some other parameters. Also, I came to think that it was updating the model because every time I passed the same sentence toinfer_vector
and then tomost_similar
, I got different results each time -
Erock about 7 years
infer_vector
like the training is has non-deterministic elements. You will get different vectors on each call. There are a number of discussions out there on Gensim's mailing list and their issue log on github. Here is a good one good example: github.com/RaRe-Technologies/gensim/issues/447. Also, you can test if the model changes. See my edit. -
Antoine over 6 yearsit's clearly stated in doc2vec paper that at inference time, all the parameters of the model are fixed. So the model definitely doesn't get updated.
-
user2849678 about 5 years@ClockSlave Yes, infer_vector is changing the model. I am reloading the model, after infer_vector & the output is deterministic. Very useful post!