Huggingface saving tokenizer

14,112

Solution 1

save_vocabulary(), saves only the vocabulary file of the tokenizer (List of BPE tokens).

To save the entire tokenizer, you should use save_pretrained()

Thus, as follows:

BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained("./models/tokenizer/")
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")

Edit:

for some unknown reason: instead of

tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")

using

tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")

works.

Solution 2

You need to save both your model and tokenizer in the same directory. HuggingFace is actually looking for the config.json file of your model, so renaming the tokenizer_config.json would not solve the issue

Solution 3

Renaming "tokenizer_config.json" file -- the one created by save_pretrained() function -- to "config.json" solved the same issue on my environment.

Share:
14,112
sachinruk
Author by

sachinruk

PhD in Bayesian Machine Learning. Obsessed with DL. Currently dipping toes in Reinforcement Learning.

Updated on June 15, 2022

Comments

  • sachinruk
    sachinruk about 2 years

    I am trying to save the tokenizer in huggingface so that I can load it later from a container where I don't need access to the internet.

    BASE_MODEL = "distilbert-base-multilingual-cased"
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
    tokenizer.save_vocabulary("./models/tokenizer/")
    tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
    

    However, the last line is giving the error:

    OSError: Can't load config for './models/tokenizer3/'. Make sure that:
    
    - './models/tokenizer3/' is a correct model identifier listed on 'https://huggingface.co/models'
    
    - or './models/tokenizer3/' is the correct path to a directory containing a config.json file
    

    transformers version: 3.1.0

    How to load the saved tokenizer from pretrained model in Pytorch didn't help unfortunately.

    Edit 1

    Thanks to @ashwin's answer below I tried save_pretrained instead, and I get the following error:

    OSError: Can't load config for './models/tokenizer/'. Make sure that:
    
    - './models/tokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'
    
    - or './models/tokenizer/' is the correct path to a directory containing a config.json file
    

    the contents of the tokenizer folder is below: enter image description here

    I tried renaming tokenizer_config.json to config.json and then I got the error:

    ValueError: Unrecognized model in ./models/tokenizer/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, pegasus, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder
    
  • Ashwin Geet D'Sa
    Ashwin Geet D'Sa over 3 years
    Tried looking into it. It seems like a bug. And as you have figured out it saves tokenizer_config.json and expects config.json.
  • Ashwin Geet D'Sa
    Ashwin Geet D'Sa over 3 years
    As a workaround, since you are not modifying the tokenizer, you get model using from_pretrained, then save the model. You can also load the tokenizer from the saved model. This should be a tentative workaround.
  • Ashwin Geet D'Sa
    Ashwin Geet D'Sa over 3 years
    Please check out the modification.
  • cronoik
    cronoik over 3 years
    @sachinruk: Just in case you have to work with the AutoTokenizers, you have to save the corresponding config as shown here.
  • Ashwin Geet D'Sa
    Ashwin Geet D'Sa over 3 years
    @cronoik, I checked your answer in the other post. However, I was curious to know if there is any raised issue on github? I could not find any issue concerning this problem.
  • cronoik
    cronoik over 3 years
    @AshwinGeetD'Sa Yes, there is. I have linked to it in the first sentence :) But the issue was closed. I will reopen it tomorrow and provide a patch.
  • Ashwin Geet D'Sa
    Ashwin Geet D'Sa over 3 years
    That's awesome :)
  • user5520049
    user5520049 almost 3 years
    excuse me does evaluate the captioning for the images will be apply to testing for example if i'm using coco ?
  • MAC
    MAC over 2 years
    tokenizer.save_pretrained("/home/pchhapolika/Bert_multilingu‌​al_exp_TCM/model_mlm‌​_exp1") produces 4 files when I add new tokens. ideally when I save tokenizer it should produce only one tokenizer.json file?