Cautious Optimizers: Improving Training with One Line of Code
This work addresses the need for faster and more stable optimizers in deep learning training, particularly for transformer models, offering a simple and effective solution with broad applicability.
The authors tackled the problem of improving training speed and stability for momentum-based optimizers like AdamW by proposing a single-line modification, which they call cautious optimizers, achieving up to 1.47 times speed-up in pretraining tasks like Llama and MAE and better results in LLM post-training tasks.
AdamW has been the default optimizer for transformer pretraining. For many years, our community searched for faster and more stable optimizers with only constrained positive outcomes. In this work, we propose a single-line modification in Pytorch to any momentum-based optimizer, which we rename cautious optimizer, e.g. C-AdamW and C-Lion. Our theoretical result shows that this modification preserves Adam's Hamiltonian function and it does not break the convergence guarantee under the Lyapunov analysis. In addition, a whole new family of optimizers is revealed by our theoretical insight. Among them, we pick the simplest one for empirical experiments, showing not only speed-up on Llama and MAE pretraining up to $1.47$ times, but also better results in LLM post-training tasks. Code is available at https://github.com/kyleliang919/C-Optim.