Model not training beyond 1st epoch

Environment info

  • transformers version: 4.4.0.dev0
  • Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.7.0+cu101 (True)
  • Tensorflow version (GPU?): 2.4.1 (True)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No (Single GPU) –> COLAB

Who can help

Models:

Information

Model I am using (Bert, XLNet …): RoBERTa

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

First off, this issue is basically a continuation of #10055 but since that error was mostly resolved, I have thus opened another issue. I am using a private dataset, so I am not at liberty to share it. However, I can provide a clue as to how the csv looks like:-


,ID,Text,Label
......................
Id_1, "Lorem Ipsum", 14

This is the code:-


!git clone https://github.com/huggingface/transformers.git
!cd transformers
!pip install -e .

train_text = list(train['Text'].values)
train_label = list(train['Label'].values)

val_text = list(val['Text'].values)
val_label = list(val['Label'].values)

from transformers import RobertaTokenizer, TFRobertaForSequenceClassification
import tensorflow as tf

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = TFRobertaForSequenceClassification.from_pretrained('roberta-base')

train_encodings = tokenizer(train_text, truncation=True, padding=True)
val_encodings = tokenizer(val_text, truncation=True, padding=True)

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_label
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_label
))

#----------------------------------------------------------------------------------------------------------------------
#Since The trainer does not work, I will use the native one
from transformers import TFTrainingArguments, TFTrainer

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

with training_args.strategy.scope():
    model = TFRobertaForSequenceClassification.from_pretrained("roberta-base")

trainer = TFTrainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()
#----------------------------------------------------------------------------------------------------------------------
#Using Native Tensorflow 

from transformers import TFRobertaForSequenceClassification
import tensorflow as tf

model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=1)

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-18)

loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)

model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy']) # can also use any keras loss fn
model.fit(train_dataset.batch(8), validation_data = val_dataset.batch(64), epochs=15, batch_size=8)

The Problems:

  • Cannot train using the Trainer() method. The cell successfully executes, but it does nothing – does not start training at all. This is not much of a major issue but it may be a factor in this problem.
  • Model does not train more than 1 epoch :—> I have shared this log for you, where you can clearly see that the model does not train beyond 1st epoch; The rest of epochs just do what the first accomplished:-
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Epoch 1/5
WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).WARNING:tensorflow:AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7f5b14f1b6c8>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f5b323fb2a0> is not a module, class, method, function, traceback, frame, or code object
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7f5b14f1b6c8>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7f5b323fb2a0> is not a module, class, method, function, traceback, frame, or code object
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

WARNING:tensorflow:AutoGraph could not transform <function wrap at 0x7f5b301d3c80> and will run it as-is.
Cause: while/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function wrap at 0x7f5b301d3c80> and will run it as-is.
Cause: while/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
180/180 [==============================] - ETA: 0s - loss: 0.0000e+00 - accuracy: 0.0022WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
180/180 [==============================] - 150s 589ms/step - loss: 0.0000e+00 - accuracy: 0.0022 - val_loss: 0.0000e+00 - val_accuracy: 0.0077
Epoch 2/5
180/180 [==============================] - 105s 582ms/step - loss: 0.0000e+00 - accuracy: 0.0022 - val_loss: 0.0000e+00 - val_accuracy: 0.0077
Epoch 3/5
180/180 [==============================] - 105s 582ms/step - loss: 0.0000e+00 - accuracy: 0.0022 - val_loss: 0.0000e+00 - val_accuracy: 0.0077

I think the problem may be that the activation function may be wrong. For CategoricalCrossentropy we need a Sigmoid loss but maybe the activation used in my code is not that.

Can anyone tell me how exactly to change the activation function, or maybe other thoughts on the potential problem? I have tried changing the learning rate with no effect.

2 thoughts on “Model not training beyond 1st epoch

  1. Could you please post this on the forum, rather than here? The authors of HuggingFace like to keep this place for bugs or feature requests, and they’re more than happy to help you on the forum.

    Looking at your code, this seems more like an issue with preparing the data correctly for the model.

    Take a look at this example in the docs on how to perform text classification with the Trainer.

  2. Not very pleased with your reply, please ask someone a question if you are unclear about something rather than trying to just close an issue.

    I want to jump in here and let you know that this kind of behavior is inappropriate. @NielsRogge is doing his best to help you here and he is doing this on his own free time. “My model is not training” is very vague and doesn’t seem like a bug, so suggesting to take this on the forums is very appropriate: more people will be able to help you there.

    Please respect that this is an open-source project. No one has to help you solve your bug so staying open-mined and kind will go a long way into getting the help you need.