Search
🤗

Huggingface/Trainer GPU vs TPU Benchmark

Created
3/7/2021, 2:32:21 PM
Tags
Empty
pytorch를 기반으로 작성된 Huggingface 의 roberta 모델 학습 코드를 기준으로 GPU/TPU 성능 비교 실험
학습 데이터는 예시와 동일하게 저용량의 wikitext 샘플 데이터셋 사용 (4,798 문장) 학습은 3epoch
모델 config
RobertaConfig { "architectures": [ "RobertaForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "eos_token_id": 2, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "layer_norm_eps": 1e-05, "max_position_embeddings": 514, "model_type": "roberta", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "position_embedding_type": "absolute", "transformers_version": "4.3.2", "type_vocab_size": 1, "use_cache": true, "vocab_size": 50265 }
Shell
학습 config
Training/evaluation parameters TrainingArguments(output_dir=../test-result, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Feb20_17-01-13_plm-benchmark-pytorch-tpu, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=8, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=../test-result, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, _n_gpu=0)
Shell
Pytorch + TPU 세팅은 다음 문서를 참조 (TFRC TPU 활용)
*GCP 콘솔에서 CPU instance+TPU 만드는거 보다 gcloud CLI 로 만드는게 훨씬 편하고 효율적임
Benchmark
Search
Name
batch-size
train steps
train-time(sec)
samples/sec
x faster
huggingface-pytorch-v100x4
Open
16
900
336.2196
huggingface-pytorch-v100x4
Open
32
450
259.6039
huggingface-pytorch-v100x4
Open
64
225
245.2095
huggingface-pytorch-tpuv3x8
Open
16
900
237.898
huggingface-pytorch-tpuv3x8
Open
32
450
128.7528
huggingface-pytorch-tpuv3x8
Open
64
225
90.8725
huggingface-pytorch-tpuv3x8
Open
128
114
66.8629
huggingface-tf-v100x1
Open
16
900
703
huggingface-tf-v100x4
Open
32
450
213.365
huggingface-tf-v100x4
Open
64
225
203
huggingface-tf-tpuv3x8 (N/A)
Open
COUNT12
* GPU, TPU 모두 OOM 발생할 때 까지 batch-size 를 키워 측정하였음.
* TPU는 경우에 학습 초기(1-2스텝)에 약 150초 간의 warmup 이 발생하는 경우가 있음. 동일한 스크립트를 연속적으로 다시 실행하면 warmup 없이 바로 실행됨. 위 밴치마크에서는 warmup 이 발생하지 않는 경우의 시간을 측정 하였음
* max_seq_len = 1024
* language-modeling TFTrainer 를 custom 하게 직접 구현하였음 (https://gist.github.com/codertimo/55e608b1173aac22989aef2ff58faafe)
* roberta 이기 때문에 dynamic masking 사용