Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

22,117

Solution 1

You are not loading the file correctly. You should use load() instead of load_word2vec_format(). The latter is used when you train a model using the C code, and save the model in a binary format. However you are not saving the model in a binary format, and are training it using python. So you can simply use the following code and it should work:

models = gensim.models.Word2Vec.load('300features_40minwords_10context.txt')

Solution 2

As per the other answers, knowing the way you save the file is important because there are specific ways to load it as well. But, you can simply use the flag unicode_errors='ignore' to skip this issue and load the model as you want.

import gensim  

model = gensim.models.KeyedVectors.load_word2vec_format(file_path, binary=True, unicode_errors='ignore')   

By default, this flag is set to 'strict': unicode_errors='strict'.

According to the documentation, the following is given as the reason as to why errors like this occur.

unicode_errors : str, optional default 'strict', is a string suitable to be passed as the errors argument to the unicode() (Python 2.x) or str() (Python 3.x) function. If your source file may include word tokens truncated in the middle of a multibyte unicode character (as is common from the original word2vec.c tool), 'ignore' or 'replace' may help.

All of the above answers are helpful, if we really can keep track of how each model was saved. But what if we have a bunch of models, that we need to load, and create a general method for it? We can use the above flag to do so.

I myself have experienced instances where I train multiple models using the original word2vec.c file, but when I try to load it into gensim, some models will load successfully, and some would give the unicode errors, I have found the above flag to be helpful and convenient.

Solution 3

If you save your model with:

model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')

Then load word2vec with load_word2vec_format method would cause the issue. To make it work you should use:

wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')

The same thing also happen when you save model with:

 model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)

And then, want to load with KeyedVectors.load method. In this situation, use:

wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)

Solution 4

If you saved your model with save(), you must use load()

load_word2vec_format is for the model generated by google, not for the model generated by gensim

Share:
22,117
user168983
Author by

user168983

Updated on December 03, 2020

Comments

  • user168983
    user168983 over 3 years

    I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.

        -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py 
    Traceback (most recent call last):
      File "prog_w2v.py", line 7, in <module>
        models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
      File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
        header = utils.to_unicode(fin.readline())
      File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
        return unicode(text, encoding, errors=errors)
      File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
    

    I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.

    import gensim
    import time
    start = time.time()    
    models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) 
    end = time.time()   
    print end-start,"   seconds"
    

    I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.