How to load sentences into Python gensim?

11,607

Solution 1

A list of utf-8 sentences. You can also stream the data from the disk.

Make sure it's utf-8, and split it:

sentences = [ "the quick brown fox jumps over the lazy dogs",
"Then a cop quizzed Mick Jagger's ex-wives briefly." ]
word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)

Solution 2

Like alKid pointed out, make it utf-8.

Talking about two additional things you might have to worry about.

  1. Input is too large and you're loading it from a file.
  2. Removing stop words from the sentences.

Instead of loading a big list into the memory, you can do something like:

import nltk, gensim
class FileToSent(object):    
    def __init__(self, filename):
        self.filename = filename
        self.stop = set(nltk.corpus.stopwords.words('english'))

    def __iter__(self):
        for line in open(self.filename, 'r'):
        ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
        yield ll

And then,

sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)
Share:
11,607
john mangual
Author by

john mangual

Data Scientist @ Explorer Media.

Updated on August 01, 2022

Comments

  • john mangual
    john mangual over 1 year

    I am trying to use the word2vec module from gensim natural language processing library in Python.

    The docs say to initialize the model:

    from gensim.models import word2vec
    model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
    

    What format does gensim expect for the input sentences? I have raw text

    "the quick brown fox jumps over the lazy dogs"
    "Then a cop quizzed Mick Jagger's ex-wives briefly."
    etc.
    

    What additional processing do I need to post into word2fec?


    UPDATE: Here is what I have tried. When it loads the sentences, I get nothing.

    >>> sentences = ['the quick brown fox jumps over the lazy dogs',
                 "Then a cop quizzed Mick Jagger's ex-wives briefly."]
    >>> x = word2vec.Word2Vec()
    >>> x.build_vocab([s.encode('utf-8').split( ) for s in sentences])
    >>> x.vocab
    {}
    
  • alko
    alko over 10 years
    actually, sentence has to be a list of words, not a string, i.e. s.encode('utf-8').split()
  • aIKid
    aIKid over 10 years
    Whoops sorry. Updated. Thanks
  • john mangual
    john mangual over 10 years
    RuntimeError: you must first build vocabulary before training the model
  • aIKid
    aIKid over 10 years
    Which line? Is that from my code? If it isn't, then it's a separate question.
  • john mangual
    john mangual over 10 years
    @alKid the second line. I imported gensim and ran your script verbatim. File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2v‌​ec.py", line 162, in __init__ and "/usr/local/lib/python2.7/dist-packages/gensim/models/word2v‌​ec.py", line 244, in train
  • john mangual
    john mangual over 10 years
    @alkid Or if you look at my update, I ran Word2Vec.build_vocab directly but vocab was still empty afterwards.
  • Radim
    Radim over 10 years
    Enable logging and observe what it says. Therein lies your answer. Spoiler: min_count=5.
  • Radim
    Radim over 10 years
    @alKid good answer, but it's a sequence (an iterable) of sentences = not necessarily a list. This makes a big differences when sentences is larger than RAM, i.e. streamed from disk.
  • aIKid
    aIKid over 10 years
    @Radim Thanks a lot for observing that.
  • Mona Jalal
    Mona Jalal over 7 years
    @alkid can you show how to input data from disk? also what other datasets besides text8 are available?
  • Lcat
    Lcat over 6 years
    adding utf8 encoding gives me TypeError: can't concat bytes to string