1.3GB dataset creates over 107GB of cache file!

Environment info

  • transformers version: 4.4.0 dev0
  • Platform: Google Colab
  • Python version: 3.6
  • PyTorch version (GPU?): 1.7
  • Tensorflow version (GPU?): None
  • Using GPU in script?: None. Colab TPU is used
  • Using distributed or parallel set-up in script?: Using default run_mlm.py script

Who can help



Model I am using (Bert, XLNet …): DistilBert

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

!python /content/transformers/examples/xla_spawn.py --num_cores 8 /content/transformers/examples/language-modeling/run_mlm.py 
--model_type distilbert --config_name /content/TokenizerFiles \
--tokenizer_name /content/drive/TokenizerFiles \
--train_file Corpus.txt \
--mlm_probability 0.15 \ 
--output_dir "/content/TrainingCheckpoints" \
--do_train \
--per_device_train_batch_size 32 \
--save_steps 500 --disable_tqdm False \
--line_by_line True \
--max_seq_length 128 \
--pad_to_max_length True \
--cache_dir /content/cache_dir --save_total_limit 2

The script ends up creating more than 107GB of cache files only with 54%% processing done which crashes the Colab environment
This means that 200+ GB of space is required to cache and preprocess a mere 1GB file. Am I doing something wrong here? I ran the same script a few days ago and it didn’t give me any such “Out of disk space” error. Because I wanted to use the TPU, I changed pad_to_max_length=True (10192) . That’s all I changed and it does this. Let me know if anyone requires any more data to help me out with this

Expected behavior

The dataset should cache in a minimum amount of disk space. It currently occupies over 150-200x the space of the actual dataset

1 possible answer(s) on “1.3GB dataset creates over 107GB of cache file!

  1. Trainer in master completely supports set_transform. If there are some columns removed that should not be, you just have to set the training arguments remove_unused_columns to False for the time being.