How to compare sentence similarities using embeddings from BERT

11,301

Solution 1

You can use the [CLS] token as a representation for the entire sequence. This token is typically prepended to your sentence during the preprocessing step. This token that is typically used for classification tasks (see figure 2 and paragraph 3.2 in the BERT paper).

It is the very first token of the embedding.

Alternatively you can take the average vector of the sequence (like you say over the first(?) axis), which can yield better results according to the huggingface documentation (3rd tip).

Note that BERT was not designed for sentence similarity using the cosine distance, though in my experience it does yield decent results.

Solution 2

In addition to an already great accepted answer, I want to point you to sentence-BERT, which discusses the similarity aspect and implications of specific metrics (like cosine similarity) in greater detail. They also have a very convenient implementation online. The main advantage here is that they seemingly gain a lot of processing speed compared to a "naive" sentence embedding comparison, but I am not familiar enough with the implementation itself.

Importantly, there is also generally a more fine-grained distinction in what kind of similarity you want to look at. Specifically for that, there is also a great discussion in one of the task papers from SemEval 2014 (SICK dataset), which goes into more detail about this. From your task description, I am assuming that you are already using data from one of the later SemEval tasks, which also extended this to multilingual similarity.

Share:
11,301
KOB
Author by

KOB

Updated on June 15, 2022

Comments

  • KOB
    KOB about 2 years

    I am using the HuggingFace Transformers package to access pretrained models. As my use case needs functionality for both English and Arabic, I am using the bert-base-multilingual-cased pretrained model. I need to be able to compare the similarity of sentences using something such as cosine similarity. To use this, I first need to get an embedding vector for each sentence, and can then compute the cosine similarity.

    Firstly, what is the best way to extratc the semantic embedding from the BERT model? Would taking the last hidden state of the model after being fed the sentence suffice?

    import torch
    from transformers import BertModel, BertTokenizer
    
    model_class = BertModel
    tokenizer_class = BertTokenizer
    pretrained_weights = 'bert-base-multilingual-cased'
    
    tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
    model = model_class.from_pretrained(pretrained_weights)
    
    sentence = 'this is a test sentence'
    
    input_ids = torch.tensor([tokenizer.encode(sentence, add_special_tokens=True)])
    with torch.no_grad():
        output_tuple = model(input_ids)
        last_hidden_states = output_tuple[0]
    
    print(last_hidden_states.size(), last_hidden_states)
    

    Secondly, if this is a sufficient way to get embeddings from my sentence, I now have another problem where the embedding vectors have different lengths depending on the length of the original sentence. The shapes output are [1, n, vocab_size], where n can have any value.

    In order to compute two vectors' cosine similarity, they need to be the same length. How can I do this here? Could something as naive as first summing across axis=1 still work? What other options do I have?

  • KOB
    KOB over 4 years
    Ok I see - very interesting. So to try both of the methods you outlined, say after feeding a sentence through the model as I have shown above, and then the last_hidden_states I have extracted had a shape of [1, 9, 768], then I could (1) use the [CLS] token as last_hidden_states[0][0], giving me a vector of length 768, or (2) get the average across the middle axis using last_hidden_states.mean(1), also giving a vector of length 768?
  • Swier
    Swier over 4 years
    Yes, either of those should give a meaningful vector representing the input sentence.
  • Thang Pham
    Thang Pham about 4 years
    Hi @Swier, do you think if we compute an average vector, the sentence representation will lose information or its dependencies (e.g. word order)? Is there any way to compare sentence embeddings if sentence similarity does not work?
  • Swier
    Swier about 4 years
    @ThangM.Pham The sentence embedding will never contain all the information in the original sentence; they contain the information that is most useful for the training task. If an embedding doesn't prove useful for your problem, you'll either have to continue training it for a few iterations, or find an embedding that is suited for your task. I suggest you take a look at the BERT implementations suggested by dennlinger in the other answer to this question.
  • user8291021
    user8291021 about 4 years
    Thank you! Yes - I was able to get a very good solution with sentence-BERT through a triplet loss model.