PyTorch / Gensim - How to load pre-trained word embeddings

43,581

Solution 1

I just wanted to report my findings about loading a gensim embedding with PyTorch.


  • Solution for PyTorch 0.4.0 and newer:

From v0.4.0 there is a new function from_pretrained() which makes loading an embedding very comfortable. Here is an example from the documentation.

import torch
import torch.nn as nn

# FloatTensor containing pretrained weights
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
# Get embeddings for index 1
input = torch.LongTensor([1])
embedding(input)

The weights from gensim can easily be obtained by:

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated

As noted by @Guglie: in newer gensim versions the weights can be obtained by model.wv:

weights = model.wv

  • Solution for PyTorch version 0.3.1 and older:

I'm using version 0.3.1 and from_pretrained() isn't available in this version.

Therefore I created my own from_pretrained so I can also use it with 0.3.1.

Code for from_pretrained for PyTorch versions 0.3.1 or lower:

def from_pretrained(embeddings, freeze=True):
    assert embeddings.dim() == 2, \
         'Embeddings parameter is expected to be 2-dimensional'
    rows, cols = embeddings.shape
    embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)
    embedding.weight = torch.nn.Parameter(embeddings)
    embedding.weight.requires_grad = not freeze
    return embedding

The embedding can be loaded then just like this:

embedding = from_pretrained(weights)

I hope this is helpful for someone.

Solution 2

I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.

You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.

Solution 3

from gensim.models import Word2Vec

model = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4)
#gensim model created

import torch

weights = torch.FloatTensor(model.wv.vectors)
embedding = nn.Embedding.from_pretrained(weights)

Solution 4

I had the same question except that I use torchtext library with pytorch as it helps with padding, batching, and other things. This is what I've done to load pre-trained embeddings with torchtext 0.3.0 and to pass them to pytorch 0.4.1 (the pytorch part uses the method mentioned by blue-phoenox):

import torch
import torch.nn as nn
import torchtext.data as data
import torchtext.vocab as vocab

# use torchtext to define the dataset field containing text
text_field = data.Field(sequential=True)

# load your dataset using torchtext, e.g.
dataset = data.Dataset(examples=..., fields=[('text', text_field), ...])

# build vocabulary
text_field.build_vocab(dataset)

# I use embeddings created with
# model = gensim.models.Word2Vec(...)
# model.wv.save_word2vec_format(path_to_embeddings_file)

# load embeddings using torchtext
vectors = vocab.Vectors(path_to_embeddings_file) # file created by gensim
text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)

# when defining your network you can then use the method mentioned by blue-phoenox
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))

# pass data to the layer
dataset_iter = data.Iterator(dataset, ...)
for batch in dataset_iter:
    ...
    embedding(batch.text)

Solution 5

Had similar problem: "after training and saving embeddings in binary format using gensim, how I load them to torchtext?"

I just saved the file to txt format and then follow the superb tutorial of loading custom word embeddings.

def convert_bin_emb_txt(out_path,emb_file):
    txt_name = basename(emb_file).split(".")[0] +".txt"
    emb_txt_file = os.path.join(out_path,txt_name)
    emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
    emb_model.save_word2vec_format(emb_txt_file,binary=False)
    return emb_txt_file

emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
custom_embeddings = vocab.Vectors(name=emb_txt_file,
                                  cache='custom_embeddings',
                                  unk_init=torch.Tensor.normal_)

TEXT.build_vocab(train_data,
                 max_size=MAX_VOCAB_SIZE,
                 vectors=custom_embeddings,
                 unk_init=torch.Tensor.normal_)

tested for: PyTorch: 1.2.0 and TorchText: 0.4.0.

I added this answer because with the accepted answer I was not sure how to follow the linked tutorial and initialize all words not in the embeddings using the normal distribution and how to make the vectors and equal to zero.

Share:
43,581

Related videos on Youtube

MBT
Author by

MBT

Updated on April 03, 2020

Comments

  • MBT
    MBT about 4 years

    I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.

    So my question is, how do I get the embedding weights loaded by gensim into the PyTorch embedding layer.

    Thanks in Advance!

  • MBT
    MBT about 6 years
    I didn't know there is a _weight parameter in the constructor, I will try it out - thank you!
  • Geoffrey Negiar
    Geoffrey Negiar over 5 years
    What is the input to your model after that? Is it the text itself or the 1-hot encoding of the text?
  • MBT
    MBT over 5 years
    PyTorch is not using the one-hot encoding, you can just use integer ids / token ids to access the respective embeddings: torch.LongTensor([1]) or for a sequence: torch.LongTensor(any_sequence) resp. torch.LongTensor([1, 2, 5, 9, 12, 92, 7]). As output you will get the respective embeddings.
  • MBT
    MBT over 5 years
    Thanks for your reply. I've taken a look at the gensim to check your approach. Taking a look here at the gensim page: radimrehurek.com/gensim/models/word2vec.html#usage-examples It says the Word2Vec model is only used for training the word vectors, as this format is much slower than KeyedVectors. After you're done with training you normally save them into KeyedVectors model. This model is dedicated for saving pre-trained vectors "resulting in a much smaller and faster object" than Word2Vec model. You can do it that way, but I see no benefit in using it this way.
  • Jibin Mathew
    Jibin Mathew over 5 years
    Thanks, @blue-phoenox I had read that I did this code under the assumption that the embeddings are created and used right away rather than loading from a file.
  • MBT
    MBT over 5 years
    Of course you can do that. But this would mean that every time you start the training process you would also train the embeddings. This is just wasted computation then and not really the idea of pre-trained embeddings. When I create models, I normally run them multiple times and I do not wan't to train my pre-trained embeddings every time again when I start the training process of my model.
  • Jibin Mathew
    Jibin Mathew over 5 years
    the main emphasis is on the torch section and hence, I leave the reader to deal with gensim model and loading, There could be situations wherein the dev could use gensim model right after creation
  • MBT
    MBT over 5 years
    I was just pointing out that in this use-case the vectors are not really pre-trained. In your code example it doesn't load pre-trained vectors but instead it trains new word vectors instead. And I was just wondering if there was another use-case, therefore I was asking.
  • Jinglesting
    Jinglesting almost 5 years
    @blue-phoenox how do you get the integer/token ids please?
  • Clement Attlee
    Clement Attlee over 4 years
    @Jinglesting This is not a general answer and could cause performance to drop, since the pre-trained embedding potentially uses a different indexing than the one you have used in your application.
  • Guglie
    Guglie over 4 years
    with newer versions of gensim vectors are in model.wv.vectors
  • MBT
    MBT over 4 years
    @Guglie Thank you for mentioning this! I've added it.
  • information_interchange
    information_interchange about 4 years
    Actually, I think this answer is incorrect. It requires that we have the same token to label mapping. i.e. we require that if label 401 corresponds to "natural" in the gensim vectors, then in our own model, we should carefully require that label 401 also corresponds to "natural"
  • information_interchange
    information_interchange about 4 years
    Using the full model is the more principled approach: radimrehurek.com/gensim/models/keyedvectors.html
  • z.ghane
    z.ghane over 2 years
    You mean self.classifier(u)?