How to load sentences into Python gensim?
11,607
Solution 1
A list of utf-8
sentences. You can also stream the data from the disk.
Make sure it's utf-8
, and split it:
sentences = [ "the quick brown fox jumps over the lazy dogs",
"Then a cop quizzed Mick Jagger's ex-wives briefly." ]
word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)
Solution 2
Like alKid
pointed out, make it utf-8
.
Talking about two additional things you might have to worry about.
- Input is too large and you're loading it from a file.
- Removing stop words from the sentences.
Instead of loading a big list into the memory, you can do something like:
import nltk, gensim
class FileToSent(object):
def __init__(self, filename):
self.filename = filename
self.stop = set(nltk.corpus.stopwords.words('english'))
def __iter__(self):
for line in open(self.filename, 'r'):
ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
yield ll
And then,
sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)
Comments
-
john mangual over 1 year
I am trying to use the
word2vec
module fromgensim
natural language processing library in Python.The docs say to initialize the model:
from gensim.models import word2vec model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
What format does
gensim
expect for the input sentences? I have raw text"the quick brown fox jumps over the lazy dogs" "Then a cop quizzed Mick Jagger's ex-wives briefly." etc.
What additional processing do I need to post into
word2fec
?
UPDATE: Here is what I have tried. When it loads the sentences, I get nothing.
>>> sentences = ['the quick brown fox jumps over the lazy dogs', "Then a cop quizzed Mick Jagger's ex-wives briefly."] >>> x = word2vec.Word2Vec() >>> x.build_vocab([s.encode('utf-8').split( ) for s in sentences]) >>> x.vocab {}
-
alko over 10 yearsactually, sentence has to be a list of words, not a string, i.e.
s.encode('utf-8').split()
-
aIKid over 10 yearsWhoops sorry. Updated. Thanks
-
john mangual over 10 years
RuntimeError: you must first build vocabulary before training the model
-
aIKid over 10 yearsWhich line? Is that from my code? If it isn't, then it's a separate question.
-
john mangual over 10 years@alKid the second line. I imported
gensim
and ran your script verbatim.File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 162, in __init__
and"/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 244, in train
-
john mangual over 10 years@alkid Or if you look at my update, I ran
Word2Vec.build_vocab
directly butvocab
was still empty afterwards. -
Radim over 10 yearsEnable logging and observe what it says. Therein lies your answer. Spoiler:
min_count=5
. -
Radim over 10 years@alKid good answer, but it's a sequence (an iterable) of sentences = not necessarily a list. This makes a big differences when
sentences
is larger than RAM, i.e. streamed from disk. -
aIKid over 10 years@Radim Thanks a lot for observing that.
-
Mona Jalal over 7 years@alkid can you show how to input data from disk? also what other datasets besides text8 are available?
-
Lcat over 6 yearsadding utf8 encoding gives me TypeError: can't concat bytes to string