Python: gensim: RuntimeError: you must first build vocabulary before training the model

python gensim word2vec

29,183

Solution 1

Default min_count in gensim's Word2Vec is set to 5. If there is no word in your vocab with frequency greater than 4, your vocab will be empty and hence the error. Try

voc_vec = word2vec.Word2Vec(vocab, min_count=1)

Solution 2

Input to the gensim's Word2Vec can be a list of sentences or list of words or list of list of sentences.

E.g.

1. sentences = ['I love ice-cream', 'he loves ice-cream', 'you love ice cream']
2. words = ['i','love','ice - cream', 'like', 'ice-cream']
3. sentences = [['i love ice-cream'], ['he loves ice-cream'], ['you love ice cream']]

build the vocab before training

model.build_vocab(sentences, update=False)

just check out the link for detailed info

29,183

user56591

Updated on March 07, 2020

Comments

user56591 about 4 years

I know that this question has been asked already, but I was still not able to find a solution for it.

I would like to use gensim's word2vec on a custom data set, but now I'm still figuring out in what format the dataset has to be. I had a look at this post where the input is basically a list of lists (one big list containing other lists that are tokenized sentences from the NLTK Brown corpus). So I thought that this is the input format I have to use for the command word2vec.Word2Vec(). However, it won't work with my little test set and I don't understand why.

What I have tried:

This worked:

from gensim.models import word2vec
from nltk.corpus import brown
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

brown_vecs = word2vec.Word2Vec(brown.sents())

This didn't work:

sentences = [ "the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]
vocab = [s.encode('utf-8').split() for s in sentences]
voc_vec = word2vec.Word2Vec(vocab)

I don't understand why it doesn't work with the "mock" data, even though it has the same data structure as the sentences from the Brown corpus:

vocab:

[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['yoyoyo', 'you', 'go', 'home', 'now', 'to', 'sleep']]

brown.sents(): (the beginning of it)

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

Can anyone please tell me what I'm doing wrong?

Sergey Orshanskiy about 7 years

Also if your data comes from an iterator, makes sure it has not already been consumed.
krinker almost 6 years

+1 From documentation: min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored