How to use TaggedDocument in gensim?
Solution 1
The input for a Doc2Vec model should be a list of TaggedDocument(['list','of','word'], [TAG_001]). A good practice is using the indexes of sentences as the tags. For example, to train a Doc2Vec model with two sentences (i.e. documents, paragraphs):
s1 = 'the quick fox brown fox jumps over the lazy dog'
s1_tag = '001'
s2 = 'i want to burn a zero-day'
s2_tag = '002'
docs = []
docs.append(TaggedDocument(words=s1.split(), tags=[s1_tag])
docs.append(TaggedDocument(words=s2.split(), tags=[s2_tag])
model = gensim.models.Doc2Vec(vector_size=300, window=5, min_count=5, workers=4, epochs=20)
model.build_vocab(docs)
print 'Start training process...'
model.train(docs, total_examples=model.corpus_count, epochs=model.iter)
#save model
model.save(model_path)
Solution 2
So I just experimented a bit and found this on github:
class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):
"""
A single document, made up of `words` (a list of unicode string tokens)
and `tags` (a list of tokens). Tags may be one or more unicode string
tokens, but typical practice (which will also be most memory-efficient) is
for the tags list to include a unique integer id as the only tag.
Replaces "sentence as a list of words" from Word2Vec.
so I decided to change how I use the TaggedDocument function by generating a TaggedDocument class for each document, the important thing is that you have to pass the tags as a list.
for doc in CogList:
str = open(CogPath+doc,'r').read().decode("utf-8")
str_list = str.split()
T = TaggedDocument(str_list,[doc])
docs.append(T)
Solution 3
You can use gensim's common_texts as an example:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
This will use common_texts and TaggedDocument to create the document representation expected by the Doc2Vec algorithm.
Related videos on Youtube
Farhood
Updated on July 09, 2022Comments
-
Farhood almost 2 years
I have two directories from which I want to read their text files and label them, but I don't know how to do this via
TaggedDocument
. I thought it would work as TaggedDocument([Strings],[Labels]) but this doesn't work apparently.This is my code:
from gensim import models from gensim.models.doc2vec import TaggedDocument import utilities as util import os from sklearn import svm from nltk.tokenize import sent_tokenize CogPath = "./FixedCog/" NotCogPath = "./FixedNotCog/" SamplePath ="./Sample/" docs = [] tags = [] CogList = [p for p in os.listdir(CogPath) if p.endswith('.txt')] NotCogList = [p for p in os.listdir(NotCogPath) if p.endswith('.txt')] SampleList = [p for p in os.listdir(SamplePath) if p.endswith('.txt')] for doc in CogList: str = open(CogPath+doc,'r').read().decode("utf-8") docs.append(str) print docs tags.append(doc) print "###########" print tags print "!!!!!!!!!!!" for doc in NotCogList: str = open(NotCogPath+doc,'r').read().decode("utf-8") docs.append(str) tags.append(doc) for doc in SampleList: str = open(SamplePath + doc, 'r').read().decode("utf-8") docs.append(str) tags.append(doc) T = TaggedDocument(docs,tags) model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50)
and this is the error I get:
Traceback (most recent call last): File "/home/farhood/PycharmProjects/word2vec_prj/doc2vec.py", line 34, in <module> model = models.Doc2Vec(T,alpha=.025, min_alpha=.025, min_count=1,size=50) File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 635, in __init__ self.build_vocab(documents, trim_rule=trim_rule) File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 544, in build_vocab self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey File "/home/farhood/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 674, in scan_vocab if isinstance(document.words, string_types): AttributeError: 'list' object has no attribute 'words'
-
gojomo almost 7 yearsSeparate from your main question: having the ending
min_alpha
be the same value as the startingalpha
means your training isn't doing proper stochastic gradient descent. Also, it's rare formin_count=1
to be helpful in Word2Vec/Doc2Vec training – keeping such rare words just tends to make training take longer and interfere with the quality of the remaining word-vecs/doc-vecs. -
Farhood almost 7 yearsabout
min_alpha
, I've copied it from a sample code followed by this code:for epoch in range(10): model.train(docs) model.alpha -= 0.002 # decrease the learning rate model.min_alpha = model.alpha # fix the learning rate, no decay
and about themin_count
: my data set is very limited and some words are not that much frequent but weigh a lot in the meaning, also I have filtered most stop words and frequent daily words. -
gojomo almost 7 yearsThat's a bad sample to follow. If you're passing your corpus in when creating the Doc2Vec instance, it will automatically do all its training passes, and automatically manage the learning-rate from
alpha
tomin_alpha
, and you shouldn't calltrain()
yourself. (And if you do, like you've shown without any other specifics, the latest gensim versions will throw an error because it's such a common mistake.) It is a rare, expert thing to calltrain()
yourself or much with the defaultalpha
/min_alpha
.
-
-
gojomo almost 7 yearsYes:
Doc2Vec
is expecting the corpus to be an iterable collection, where each individual item (document) is shaped like aTaggedDocument
. (That is, it has awords
list andtags
list.)