Hi,
The tokenizer is slow when adding new tokens even with the Fast class:
from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer
# Maybe this url for the files:
# https://huggingface.co/transformers/v3.1.0/_modules/transformers/tokenization_gpt2.html
paths = dict()
paths["tokenizer"] = "whatever/is/the/path/to/pretrained/vocab.json/merges.txt"
# They have to be sorted in reverse by length, otherwise the tokens arent
newtokens = range(0, 20000)
newtokens = list(newtokens)
newtokens.sort(reverse=True)
newtokens = ["new_" + str(x) for x in newtokens]
# loading tokenizer from the saved model path
tokenizers = dict()
tokenizers["fast"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["fast_custom"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["slow_custom"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])
tokenizers["slow"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])
tokenizer.add_special_tokens({
"eos_token": "</s>",
"bos_token": "<s>",
"unk_token": "<unk>",
"pad_token": "<pad>",
"mask_token": "<mask>"
})
# Add new vocab
# https://huggingface.co/transformers/v2.11.0/main_classes/tokenizer.html
# https://github.com/deepset-ai/FARM/issues/157
for k in tokenizers:
if "custom" in k:
print(k)
print("Vocab length before:", len(tokenizers[k].get_vocab()))
tokenizers[k].add_tokens(newtokens)
print("Vocab length after:", len(tokenizers[k].get_vocab()))
# creating the configurations from which the model can be made
config = GPT2Config(
vocab_size=len(tokenizer),
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id
)
# creating the model
# https://huggingface.co/transformers/_modules/transformers/configuration_gpt2.html
model = TFGPT2LMHeadModel(config)
# Differences when tokenising the text...
text = "this is a sentence containing new_200"
for k,v in tokenizers.items():
print(k, v.tokenize(text))
and then profiling the speed in jupyter:
for k in tokenizers:
print(k)
%%timeit tokenizers[k].tokenize(text)
any ideas why this may be happening? I understand that the vocab size could increase by ~20%% and that may slow things down but in this code there’s a performance difference of 1000 fold in the speed. That doesn’t seem right?
Hi @davidnarganes,
Someone from HF correct me if I am wrong, but you’ll probably get a faster response posting this issue in the Tokenizer repo:
https://github.com/huggingface/tokenizers
Best of luck