AutoTokenizer from pretrained BERT throws TypeError when encoding certain input

Environment info

  • transformers version: 4.3.2
  • Platform: Arch Linux
  • Python version: 3.9.1
  • PyTorch version (GPU?): 1.7.1, no
  • Tensorflow version (GPU?): Not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

Guess from git blame: @LysandreJik , @thomwolf @n1t0

Information

Model I am using (Bert, XLNet …): BERT

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

When I use a pretrained BERT tokenizer, it throws a TypeError on singleton input or input containing ø/æ/å.

It was discovered when I used the pretrained Maltehb/danish-bert-botxo which would fail in the below way on any input containing Danish characters (ø/æ/å), but I also realized that it happens with the standard bert-base-uncased as shown below.

Steps to reproduce the behavior:

  1. Run these line
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer.encode(["hello", "world"])                          # <--- This works
tokenizer.encode(["hello"])                                   # <--- This throws the below shown stack trace
tokenizer.encode(["dette", "er", "en", "sø"])                 # <--- This throws the same error

Stack trace

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-ef056deb5f59> in <module>
----> 1 tokenizer.encode(["hello"])

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in encode(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, return_tensors, **kwargs)
   2102                 ``convert_tokens_to_ids`` method).
   2103         """
-> 2104         encoded_inputs = self.encode_plus(
   2105             text,
   2106             text_pair=text_pair,

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2418         )
   2419 
-> 2420         return self._encode_plus(
   2421             text=text,
   2422             text_pair=text_pair,

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    453 
    454         batched_input = [(text, text_pair)] if text_pair else [text]
--> 455         batched_output = self._batch_encode_plus(
    456             batched_input,
    457             is_split_into_words=is_split_into_words,

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    380         )
    381 
--> 382         encodings = self._tokenizer.encode_batch(
    383             batch_text_or_text_pairs,
    384             add_special_tokens=add_special_tokens,

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Expected behavior

I expect the model not to throw a type error when the types are the same.
I also expected that the tokenization would produce id’s.

This issue is caused by the above
I am grateful for the software and thank you in advance for the help!

1 possible answer(s) on “AutoTokenizer from pretrained BERT throws TypeError when encoding certain input

  1. I believe the encode method never accepted batches as inputs. We introduced encode_plus and batch_encode_plus down the road, the latter being the first to handle batching.

    While these two methods are deprecated, they’re still tested and working, and they’re used under the hood when calling __call__.

    What is happening here is that v3.5.1 is treating your input as individual words (but by all means it shouldn’t as the is_split_into_words argument is False by default), rather than as different batches, I was mistaken in my first analysis. Something did change between version v3.5.1 and v4.0.0, all the breaking changes are documented in the migration guide.

    If you want to get back to the previous behavior, you have two ways of handling it:

    • Specify that you don’t want a fast tokenizer. The main change affecting you here is that the AutoTokenizer returns a fast tokenizer by default (in Rust) rather than the python-based tokenizer. You can change that behavior with the following:
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
    • The behavior you’re relying on here is the is_split_into_words parameter: you’re passing it a list of words, rather than a sequence of words. That it worked in previous versions seems like a bug to me, here’s how you would handle it now (works with a fast tokenizer):
    tokenizer(["hello", "world"], is_split_into_words=True)
    tokenizer(["hello"], is_split_into_words=True)
    tokenizer(["dette", "er", "en", "sø"], is_split_into_words=True)