PyTorch / Gensim - How to load pre-trained word embeddings

python neural-network pytorch gensim embedding

43,581

Solution 1

I just wanted to report my findings about loading a gensim embedding with PyTorch.

Solution for PyTorch 0.4.0 and newer:

From v0.4.0 there is a new function from_pretrained() which makes loading an embedding very comfortable. Here is an example from the documentation.

import torch
import torch.nn as nn

# FloatTensor containing pretrained weights
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
# Get embeddings for index 1
input = torch.LongTensor([1])
embedding(input)

The weights from gensim can easily be obtained by:

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated

As noted by @Guglie: in newer gensim versions the weights can be obtained by model.wv:

weights = model.wv

Solution for PyTorch version 0.3.1 and older:

I'm using version 0.3.1 and from_pretrained() isn't available in this version.

Therefore I created my own from_pretrained so I can also use it with 0.3.1.

Code for from_pretrained for PyTorch versions 0.3.1 or lower:

def from_pretrained(embeddings, freeze=True):
    assert embeddings.dim() == 2, \
         'Embeddings parameter is expected to be 2-dimensional'
    rows, cols = embeddings.shape
    embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)
    embedding.weight = torch.nn.Parameter(embeddings)
    embedding.weight.requires_grad = not freeze
    return embedding

The embedding can be loaded then just like this:

embedding = from_pretrained(weights)

I hope this is helpful for someone.

Solution 2

I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.

You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.

Solution 3

from gensim.models import Word2Vec

model = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4)
#gensim model created

import torch

weights = torch.FloatTensor(model.wv.vectors)
embedding = nn.Embedding.from_pretrained(weights)

Solution 4

I had the same question except that I use torchtext library with pytorch as it helps with padding, batching, and other things. This is what I've done to load pre-trained embeddings with torchtext 0.3.0 and to pass them to pytorch 0.4.1 (the pytorch part uses the method mentioned by blue-phoenox):

import torch
import torch.nn as nn
import torchtext.data as data
import torchtext.vocab as vocab

# use torchtext to define the dataset field containing text
text_field = data.Field(sequential=True)

# load your dataset using torchtext, e.g.
dataset = data.Dataset(examples=..., fields=[('text', text_field), ...])

# build vocabulary
text_field.build_vocab(dataset)

# I use embeddings created with
# model = gensim.models.Word2Vec(...)
# model.wv.save_word2vec_format(path_to_embeddings_file)

# load embeddings using torchtext
vectors = vocab.Vectors(path_to_embeddings_file) # file created by gensim
text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)

# when defining your network you can then use the method mentioned by blue-phoenox
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))

# pass data to the layer
dataset_iter = data.Iterator(dataset, ...)
for batch in dataset_iter:
    ...
    embedding(batch.text)

Solution 5

Had similar problem: "after training and saving embeddings in binary format using gensim, how I load them to torchtext?"

I just saved the file to txt format and then follow the superb tutorial of loading custom word embeddings.

def convert_bin_emb_txt(out_path,emb_file):
    txt_name = basename(emb_file).split(".")[0] +".txt"
    emb_txt_file = os.path.join(out_path,txt_name)
    emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
    emb_model.save_word2vec_format(emb_txt_file,binary=False)
    return emb_txt_file

emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
custom_embeddings = vocab.Vectors(name=emb_txt_file,
                                  cache='custom_embeddings',
                                  unk_init=torch.Tensor.normal_)

TEXT.build_vocab(train_data,
                 max_size=MAX_VOCAB_SIZE,
                 vectors=custom_embeddings,
                 unk_init=torch.Tensor.normal_)

tested for: PyTorch: 1.2.0 and TorchText: 0.4.0.

I added this answer because with the accepted answer I was not sure how to follow the linked tutorial and initialize all words not in the embeddings using the normal distribution and how to make the vectors and equal to zero.

View more solutions

43,581

MBT

Updated on April 03, 2020

Comments

MBT about 4 years

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.

So my question is, how do I get the embedding weights loaded by gensim into the PyTorch embedding layer.

Thanks in Advance!
MBT about 6 years

I didn't know there is a _weight parameter in the constructor, I will try it out - thank you!
Geoffrey Negiar over 5 years

What is the input to your model after that? Is it the text itself or the 1-hot encoding of the text?
MBT over 5 years

PyTorch is not using the one-hot encoding, you can just use integer ids / token ids to access the respective embeddings: torch.LongTensor([1]) or for a sequence: torch.LongTensor(any_sequence) resp. torch.LongTensor([1, 2, 5, 9, 12, 92, 7]). As output you will get the respective embeddings.
MBT over 5 years

Thanks for your reply. I've taken a look at the gensim to check your approach. Taking a look here at the gensim page: radimrehurek.com/gensim/models/word2vec.html#usage-examples It says the Word2Vec model is only used for training the word vectors, as this format is much slower than KeyedVectors. After you're done with training you normally save them into KeyedVectors model. This model is dedicated for saving pre-trained vectors "resulting in a much smaller and faster object" than Word2Vec model. You can do it that way, but I see no benefit in using it this way.
Jibin Mathew over 5 years

Thanks, @blue-phoenox I had read that I did this code under the assumption that the embeddings are created and used right away rather than loading from a file.
MBT over 5 years

Of course you can do that. But this would mean that every time you start the training process you would also train the embeddings. This is just wasted computation then and not really the idea of pre-trained embeddings. When I create models, I normally run them multiple times and I do not wan't to train my pre-trained embeddings every time again when I start the training process of my model.
Jibin Mathew over 5 years

the main emphasis is on the torch section and hence, I leave the reader to deal with gensim model and loading, There could be situations wherein the dev could use gensim model right after creation
MBT over 5 years

I was just pointing out that in this use-case the vectors are not really pre-trained. In your code example it doesn't load pre-trained vectors but instead it trains new word vectors instead. And I was just wondering if there was another use-case, therefore I was asking.
Jinglesting almost 5 years

@blue-phoenox how do you get the integer/token ids please?
Clement Attlee over 4 years

@Jinglesting This is not a general answer and could cause performance to drop, since the pre-trained embedding potentially uses a different indexing than the one you have used in your application.
Guglie over 4 years

with newer versions of gensim vectors are in model.wv.vectors
MBT over 4 years

@Guglie Thank you for mentioning this! I've added it.
information_interchange about 4 years

Actually, I think this answer is incorrect. It requires that we have the same token to label mapping. i.e. we require that if label 401 corresponds to "natural" in the gensim vectors, then in our own model, we should carefully require that label 401 also corresponds to "natural"
information_interchange about 4 years

Using the full model is the more principled approach: radimrehurek.com/gensim/models/keyedvectors.html
z.ghane over 2 years

You mean self.classifier(u)?