This repository was archived by the owner on Jul 7, 2023. It is now read-only.
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
Aadm is slower than Adafactor #1920
Open
Description
Hi,
I found that training Transformers with Adam is three times slower than with Adafactor.
Here is the command I am using for Adam:
t2t-trainer \
--data_dir=./t2t/t2t_data \
--problem=translate_ende_wmt32k \
--model=transformer \
--hparams_set=transformer_base \
--hparams="batch_size=1024,learning_rate_schedule=constant*linear_warmup*rsqrt_decay, learning_rate_constant=0.1,optimizer_adam_beta2=0.999" \
--schedule=continuous_train_and_eval \
--output_dir=./t2t/t2t_train/translate_ende_wmt32k_adam_lineB \
--train_steps=300000 \
--worker_gpu=10 \
--eval_steps=5000
Here is the command I am using for Adafactor:
t2t-trainer \
--data_dir=./t2t/t2t_data \
--problem=translate_ende_wmt32k \
--model=transformer \
--hparams_set=transformer_base \
--hparams="optimizer_adafactor_factored=False,batch_size=1024,optimizer=Adafactor,learning_rate_schedule=constant*linear_warmup*rsqrt_decay, learning_rate_constant=0.1,optimizer_adafactor_multiply_by_parameter_scale=False" \
--schedule=continuous_train_and_eval \
--output_dir=./t2t/t2t_train/translate_ende_wmt32k_adafactor_lineN \
--train_steps=300000 \
--worker_gpu=10 \
--eval_steps=5000
I found that training 100 steps cost 240 seconds for Adam, while it just needs 80s for Adafactor.
Could anyone help take a look?
Thanks very much!
Metadata
Metadata
Assignees
Labels
No labels