tokenizer is slow when adding new tokens

The tokenizer is slow when adding new tokens even with the Fast class:

from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

# Maybe this url for the files:
paths = dict()
paths["tokenizer"] = "whatever/is/the/path/to/pretrained/vocab.json/merges.txt"

# They have to be sorted in reverse by length, otherwise the tokens arent 
newtokens = range(0, 20000)
newtokens = list(newtokens)
newtokens = ["new_" + str(x) for x in newtokens]

# loading tokenizer from the saved model path
tokenizers = dict()
tokenizers["fast"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["fast_custom"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["slow_custom"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])
tokenizers["slow"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])

  "eos_token": "</s>",
  "bos_token": "<s>",
  "unk_token": "<unk>",
  "pad_token": "<pad>",
  "mask_token": "<mask>"

# Add new vocab
for k in tokenizers:
    if "custom" in k:
        print("Vocab length before:", len(tokenizers[k].get_vocab()))
        print("Vocab length after:", len(tokenizers[k].get_vocab()))

# creating the configurations from which the model can be made
config = GPT2Config(

# creating the model
model = TFGPT2LMHeadModel(config)

# Differences when tokenising the text...
text = "this is a sentence containing new_200"
for k,v in tokenizers.items():
    print(k, v.tokenize(text))

and then profiling the speed in jupyter:

for k in tokenizers:
    %%timeit tokenizers[k].tokenize(text)

any ideas why this may be happening? I understand that the vocab size could increase by ~20%% and that may slow things down but in this code there’s a performance difference of 1000 fold in the speed. That doesn’t seem right?

1 possible answer(s) on “tokenizer is slow when adding new tokens