Huggingface saving tokenizer

huggingface-transformers huggingface-tokenizers

14,112

Solution 1

save_vocabulary(), saves only the vocabulary file of the tokenizer (List of BPE tokens).

To save the entire tokenizer, you should use save_pretrained()

Thus, as follows:

BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained("./models/tokenizer/")
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")

Edit:

for some unknown reason: instead of

tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")

using

tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")

works.

Solution 2

You need to save both your model and tokenizer in the same directory. HuggingFace is actually looking for the config.json file of your model, so renaming the tokenizer_config.json would not solve the issue

Solution 3

Renaming "tokenizer_config.json" file -- the one created by save_pretrained() function -- to "config.json" solved the same issue on my environment.

14,112

Author by

sachinruk

PhD in Bayesian Machine Learning. Obsessed with DL. Currently dipping toes in Reinforcement Learning.

Updated on June 15, 2022

Comments

sachinruk about 2 years

I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet.

BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_vocabulary("./models/tokenizer/")
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")

However, the last line is giving the error:

OSError: Can't load config for './models/tokenizer3/'. Make sure that:

- './models/tokenizer3/' is a correct model identifier listed on 'https://huggingface.co/models'

- or './models/tokenizer3/' is the correct path to a directory containing a config.json file

transformers version: 3.1.0

How to load the saved tokenizer from pretrained model in Pytorch didn't help unfortunately.

Edit 1

Thanks to @ashwin's answer below I tried save_pretrained instead, and I get the following error:

OSError: Can't load config for './models/tokenizer/'. Make sure that:

- './models/tokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'

- or './models/tokenizer/' is the correct path to a directory containing a config.json file

the contents of the tokenizer folder is below:

I tried renaming tokenizer_config.json to config.json and then I got the error:

ValueError: Unrecognized model in ./models/tokenizer/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, pegasus, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder

Ashwin Geet D'Sa over 3 years

Tried looking into it. It seems like a bug. And as you have figured out it saves tokenizer_config.json and expects config.json.
Ashwin Geet D'Sa over 3 years

As a workaround, since you are not modifying the tokenizer, you get model using from_pretrained, then save the model. You can also load the tokenizer from the saved model. This should be a tentative workaround.
Ashwin Geet D'Sa over 3 years

Please check out the modification.
cronoik over 3 years

@sachinruk: Just in case you have to work with the AutoTokenizers, you have to save the corresponding config as shown here.
Ashwin Geet D'Sa over 3 years

@cronoik, I checked your answer in the other post. However, I was curious to know if there is any raised issue on github? I could not find any issue concerning this problem.
cronoik over 3 years

@AshwinGeetD'Sa Yes, there is. I have linked to it in the first sentence :) But the issue was closed. I will reopen it tomorrow and provide a patch.
Ashwin Geet D'Sa over 3 years

That's awesome :)
user5520049 almost 3 years

excuse me does evaluate the captioning for the images will be apply to testing for example if i'm using coco ?
MAC over 2 years

tokenizer.save_pretrained("/home/pchhapolika/Bert_multilingu‌al_exp_TCM/model_mlm‌_exp1") produces 4 files when I add new tokens. ideally when I save tokenizer it should produce only one tokenizer.json file?