How to use Bert for long text classification?

37,834

Solution 1

You have basically three options:

  1. You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient.
  2. You can split your text in multiple subtexts, classifier each of them and combine the results back together ( choose the class which was predicted for most of the subtexts for example). This option is obviously more expensive.
  3. You can even feed the output token for each subtext (as in option 2) to another network (but you won't be able to fine-tune) as described in this discussion.

I would suggest to try option 1, and only if this is not good enough to consider the other options.

Solution 2

This paper compared a few different strategies: How to Fine-Tune BERT for Text Classification?. On the IMDb movie review dataset, they actually found that cutting out the middle of the text (rather than truncating the beginning or the end) worked best! It even outperformed more complex "hierarchical" approaches involving breaking the article into chunks and then recombining the results.

As another anecdote, I applied BERT to the Wikipedia Personal Attacks dataset here, and found that simple truncation worked well enough that I wasn't motivated to try other approaches :)

Solution 3

In addition to chunking data and passing it to BERT, check the following new approaches.

There are new researches for long document analysis. As you've asked for Bert a similar pre-trained transformer Longformer has recently been made available from ALLEN NLP (https://arxiv.org/abs/2004.05150). Check out this link for the paper.

The related work section also mentions some previous work on long sequences. Google them too. I'll suggest at least go through Transformer XL (https://arxiv.org/abs/1901.02860). As far I know it was one of the initial models for long sequences, so would be good to use it as a foundation before moving into 'Longformers'.

Solution 4

You can leverage from the HuggingFace Transformers library that includes the following list of Transformers that work with long texts (more than 512 tokens):

  • Reformer: that combines the modeling capacity of a Transformer with an architecture that can be executed efficiently on long sequences.
  • Longformer: with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.

Eight other recently proposed efficient Transformer models include Sparse Transformers (Child et al.,2019), Linformer (Wang et al., 2020), Sinkhorn Transformers (Tay et al., 2020b), Performers (Choromanski et al., 2020b), Synthesizers (Tay et al., 2020a), Linear Transformers (Katharopoulos et al., 2020), and BigBird (Zaheeret al., 2020).

The paper from the authors from Google Research and DeepMind tries to make a comparison between these Transformers based on Long-Range Arena "aggregated metrics":

Performance, speed, and memory footprint of the Transformers

They also suggest that Longformers have better performance than Reformer when it comes to the classification task.

Solution 5

I have recently (April 2021) published a paper regarding this topic that you can find on arXiv (https://arxiv.org/abs/2104.07225).

There, Table 1 allows to review previous approaches to the problem in question, and the whole manuscript is about long text classification and proposing a new method called Text Guide. This new method claims to improve performance over naive and semi-naive text selection methods used in the paper (https://arxiv.org/abs/1905.05583) that was mentioned in one of the previous answers to this question.

Long story short about your options:

  1. Low computational cost: use naive/semi naive approaches to select a part of original text instance. Examples include choosing first n tokens, or compiling a new text instance out of the beginning and end of original text instance.

  2. Medium to high computational cost: use recent transformer models (like Longformer) that have 4096 token limit instead of 512. In some cases this will allow for covering the whole text instance and the modified attention mechanism decreases computational cost, and

  3. High computational cost: divide the text instance into chunks that fit a model like BERT with ‘standard’ 512 limit of tokens per instance, deploy the model on each part separately, join the resulting vector representations.

Now, in my recently published paper there is a new method proposed called Text Guide. Text Guide is a text selection method that allows for improved performance when compared to naive or semi-naive truncation methods. As a text selection method, Text Guide doesn’t interfere with the language model, so it can be used to improve performance of models with ‘standard’ limit of tokens (512 for transformer models) or ‘extended’ limit (4096 as for instance for the Longformer model). Summary: Text Guide is a low-computational-cost method that improves performance over naive and semi-naive truncation methods. If text instances are exceeding the limit of models deliberately developed for long text classification like Longformer (4096 tokens), it can also improve their performance.

Share:
37,834
user1337896
Author by

user1337896

Updated on July 09, 2022

Comments

  • user1337896
    user1337896 almost 2 years

    We know that BERT has a max length limit of tokens = 512, So if an article has a length of much bigger than 512, such as 10000 tokens in text How can BERT be used?

  • user1337896
    user1337896 over 4 years
    I am planning to using bert as paragraph encoder, then feed to lstm, does it workable?
  • user1337896
    user1337896 over 4 years
  • Anoyz
    Anoyz about 4 years
    In my experience I also had to analyse large paragraphs and what resulted best was indeed to consider only the last 512 tokens, as they were the most informative (usually concluded the subject). But I believe this is strongly dependent on the domain and text at hand. Also, the send option presented here didn't work as well for me, because I was handling conversational text and individual sentences said little about the classification.
  • Alexandre Pieroux
    Alexandre Pieroux over 3 years
    I'll add that Longformer (I don't know for the others) still have a limitation of 4096 tokens
  • Ashish
    Ashish over 2 years
    Could you explain what they did? I went through the paper, it wasn't clear as to what they did. Did they do some change in Bert itself?
  • Cristian Arteaga
    Cristian Arteaga over 2 years
    You could achieve this by stacking together the pre-trained positional encodings. Check the source code in this link: discuss.huggingface.co/t/…