ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error
12,446
Solution 1
I had the same error. The problem was that I had None in my list, e.g:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-german-cased')
# create test dataframe
texts = ['Vero Moda Damen Übergangsmantel Kurzmantel Chic Business Coatigan SALE',
'Neu Herren Damen Sportschuhe Sneaker Turnschuhe Freizeit 1975 Schuhe Gr. 36-46',
'KOMBI-ANGEBOT Zuckerpaste STRONG / SOFT / ZUBEHÖR -Sugaring Wachs Haarentfernung',
None]
labels = [1, 2, 3, 1]
d = {'texts': texts, 'labels': labels}
test_df = pd.DataFrame(d)
So, before I converted the Dataframe columns to list I remove all None rows.
test_df = test_df.dropna()
texts = test_df["texts"].tolist()
texts_encodings = tokenizer(texts, truncation=True, padding=True)
This worked for me.
Solution 2
In my case I had to set is_split_into_words=True
https://huggingface.co/transformers/main_classes/tokenizer.html
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
Author by
Raoof Naushad
Updated on June 24, 2022Comments
-
Raoof Naushad about 2 years
def split_data(path): df = pd.read_csv(path) return train_test_split(df , test_size=0.1, random_state=100) train, test = split_data(DATA_DIR) train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list() train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.1, random_state=100) from transformers import DistilBertTokenizerFast tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased train_encodings = tokenizer(train_texts, truncation=True, padding=True) valid_encodings = tokenizer(valid_texts, truncation=True, padding=True) test_encodings = tokenizer(test_texts, truncation=True, padding=True)
When I tried to split from the dataframe using BERT tokenizers I got an error us such.
-
Evan Zamir over 3 years
train_texts
just needs to be a list of strings? -
Timbus Calin over 2 yearsCan confirm this also solved the problem in my case.