0%% GPU usage when using `hyperparameter_search`

## Environment info

  • transformers version: 4.4.0.dev0
  • Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.7.0+cu101 (True)
  • Tensorflow version (GPU?): 2.4.1 (True)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No (Single GPU) –> Colab
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help

Models:

Information

Model I am using (Bert, XLNet …): RoBERTa

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

This is in continuation with #10055 where the underlying code is the same, and it is more or less the same as the official example. The problem is that when I start hyperparameter_search then it just keeps running with 0%% GPU usage (memory is occupied) and the CPU also remains relatively idle:-


== Status ==
Memory usage on this node: 5.9/25.5 GiB
PopulationBasedTraining: 0 checkpoints, 0 perturbs
Resources requested: 1/4 CPUs, 1/1 GPUs, 0.0/14.99 GiB heap, 0.0/5.18 GiB objects (0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/_inner_2021-02-15_11-45-33
Number of trials: 1/100 (1 RUNNING)
+--------------------+----------+-------+-------------+--------------+--------------+----------------+-----------------+-----------------+--------------------+-------------------------------+---------+----------------+
| Trial name         | status   | loc   | adafactor   |   adam_beta1 |   adam_beta2 |   adam_epsilon |   learning_rate |   max_grad_norm |   num_train_epochs |   per_device_train_batch_size |    seed |   weight_decay |
|--------------------+----------+-------+-------------+--------------+--------------+----------------+-----------------+-----------------+--------------------+-------------------------------+---------+----------------|
| _inner_4fd43_00000 | RUNNING  |       | True        |     0.862131 |     0.813033 |          1e-09 |     2.34754e-05 |       0.0056821 |                  2 |                            16 | 21.1968 |        0.95152 |
+--------------------+----------+-------+-------------+--------------+--------------+----------------+-----------------+-----------------+--------------------+-------------------------------+---------+----------------+

Sometimes, there are also warnings that the single worker is pending due to lack of resources, however my CPU usage is minimum, plenty of RAM is free (~24 Gb) and GPU also some about a gig of free memory.


2021-02-15 13:56:53,761	WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffff44ed5e1383be630817647ecd01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {3.000000/4.000000 CPU, 14.990234 GiB/14.990234 GiB memory, 0.000000/1.000000 GPU, 1.000000/1.000000 node:172.28.0.2, 5.126953 GiB/5.126953 GiB object_store_memory, 1.000000/1.000000 accelerator_type:V100}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

This is how the tuner looks like:-


from ray.tune.suggest.hyperopt import HyperOptSearch
from ray.tune.schedulers import PopulationBasedTraining
from ray import tune
import random

pbt = PopulationBasedTraining(
    time_attr="training_iteration",
    metric="accuracy",
    mode="max",
    perturbation_interval=10,  # every 10 `time_attr` units
                               # (training_iterations in this case)
    hyperparam_mutations={

        "weight_decay": tune.uniform(1, 0.0001),
        "seed": tune.uniform(1,20000),
        "learning_rate": tune.choice([1e-5, 2e-5, 3e-5, 4e-5, 5e-5, 6e-5, 2e-7, 1e-7, 3e-7, 2e-8]),
        "adafactor": tune.choice(['True','False']),
        "adam_beta1": tune.uniform(1.0, 0.0),
        "adam_beta2": tune.uniform(1.0, 0),
        "adam_epsilon": tune.choice([1e-8, 2e-8, 3e-8, 1e-9, 2e-9, 3e-10]),
        "max_grad_norm": tune.uniform(1.0, 0),

    })

best_run = trainer.hyperparameter_search(n_trials=100, compute_objective='accuracy', direction="maximize", backend='ray',
                                         scheduler=pbt)

Using HyperOptScheduler causes OOMs

1 possible answer(s) on “0%% GPU usage when using `hyperparameter_search`

  1. So I tried that above, but apparently evaluate does not return “accuracy”, so as a workaround I switched to eval_accuracy.
    But this creates a new problem; this error comes in the first trial but it doesn’t go on to the next trial. Could be that it is training? GPU usage seems to be 0, so I doubt it is training but it is not terminating the process or moving on. Strange.

    
    2021-02-17 10:57:12,244	ERROR worker.py:1053 -- Possible unhandled error from worker: ray::ImplicitFunc.train_buffered() (pid=1340, ip=172.28.0.2)
      File "python/ray/_raylet.pyx", line 480, in ray._raylet.execute_task
      File "python/ray/_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
      File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 167, in train_buffered
        result = self.train()
      File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 226, in train
        result = self.step()
      File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 366, in step
        self._report_thread_runner_error(block=True)
      File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 513, in _report_thread_runner_error
        ("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
    ray.tune.error.TuneError: Trial raised an exception. Traceback:
    ray::ImplicitFunc.train_buffered() (pid=1340, ip=172.28.0.2)
      File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 248, in run
        self._entrypoint()
      File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 316, in entrypoint
        self._status_reporter.get_checkpoint())
      File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 576, in _trainable_func
        output = fn()
      File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 651, in _inner
        inner(config, checkpoint_dir=None)
      File "/usr/local/lib/python3.6/dist-packages/ray/tune/function_runner.py", line 645, in inner
        fn(config, **fn_kwargs)
      File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 160, in _objective
        local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
      File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 925, in train
        for step, inputs in enumerate(epoch_iterator):
      File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 435, in __next__
        data = self._next_data()
      File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 475, in _next_data
        data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
      File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
        data = [self.dataset[idx] for idx in possibly_batched_index]
      File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
        data = [self.dataset[idx] for idx in possibly_batched_index]
      File "<ipython-input-12-cd510628f360>", line 10, in __getitem__
    TypeError: new(): invalid data type 'str'
    

    It looks like it Is pointing to ‘objective’, which is the same function you wrote above:-

    def compute_objective(metrics):
      return metrics["eval_accuracy"]           #does not return accuracy
    

    Interestingly, removing the args compute_objectiveand direction does not yield anything, so I figured the problem must be elsewhere.

    Putting eval_accuracy in the PBT parameters and making the compute_objective solves the issue.

    Thanks a lot @amogkam for your support!! we need more people like you 👍 🚀 🥳