TFBartForConditionalGeneration with labels padded with -100 gives Nan loss.

I am pretraining T5 and Bart.
I noticed that the padding token for labels of these models should be -100 for decoder_input_ids.

I change the padding token for labels for T5(pytorch, tensorflow) and Bart(pytorch), and it works well.
But, For Bart(tensorflow) gives Nan loss.

Because of this, I also get a error message for pretraining:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of -100 which is outside the valid range of [0, 50265). Label values: 0 2387 2335 16 11962 2 -100 -100 -100 -100 -100 ...........

Environment info

  • transformers version: 4.2.2
  • Platform: ubuntu 18.04
  • Python version: 3.6
  • PyTorch version (GPU?):
  • Tensorflow version (GPU?): 2.4.0
  • Using GPU in script?: yes (colab)
  • Using distributed or parallel set-up in script?: no

Bart: @patrickvonplaten

Information

Model I am using (Bert, XLNet …): TFBartForConditionalGeneration

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

import tensorflow as tf
from transformers import BartTokenizer, TFBartForConditionalGeneration

tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
model = TFBartForConditionalGeneration.from_pretrained("facebook/bart-base")

inputs = tokenizer("My dog is <mask>", return_tensors='tf', truncation=True, max_length=16, padding="max_length")
labels_ids = tokenizer("My dog is cute", return_tensors='tf', truncation=True, max_length=16, padding="max_length").input_ids

## labels padding_token = 1
loss = model(inputs, labels=labels_ids)[0]
print(labels_ids)
print(loss)

## labels padding_token = -100
labels_ids = tf.where(
    labels_ids == 1, tf.fill(tf.shape(labels_ids), tf.constant(-100, dtype='int32')), labels_ids
)

loss = model(inputs, labels=labels_ids)[0]
print(labels_ids)
print(loss)

Resurts:

tf.Tensor(
[[    0  2387  2335    16 11962     2     1     1     1     1     1     1
      1     1     1     1]], shape=(1, 16), dtype=int32)
tf.Tensor(
[2.2291888e-05 4.8874615e-05 3.7073401e-05 7.9230859e-04 6.1941872e+00
 1.1058841e+00], shape=(6,), dtype=float32)
tf.Tensor(
[[    0  2387  2335    16 11962     2  -100  -100  -100  -100  -100  -100
   -100  -100  -100  -100]], shape=(1, 16), dtype=int32)
tf.Tensor(
[2.2291888e-05 4.8755410e-05 3.7073401e-05 7.9242775e-04 6.1941872e+00
 1.1058841e+00           nan           nan           nan           nan
           nan           nan           nan           nan           nan
           nan], shape=(16,), dtype=float32)

1 possible answer(s) on “TFBartForConditionalGeneration with labels padded with -100 gives Nan loss.

  1. Great catch @kiyoungkim1!

    It’s not very consistent what we are doing here…TFBart should have never ignored the pad_token_id as a default setting, but -100 as all other models do.

    To fix the problem, I think we should add a couple of lines that check if -100 are in the labels and if yes replaces them with the pad_token_id to have consistency with PyTorch’s Bart. It would be a pretty big breaking change to just replace pad_token_id with -100 so I think the first option is the better one. @kiyoungkim1 if you feel like opening a PR to correct this behavior we would be more than happy 🙂