Why Gradients Rapidly Increase Near the End of Training
This addresses a specific training instability issue in large language models, offering an incremental improvement for practitioners.
The paper identifies that rapid gradient norm increase near the end of LLM training is caused by an interaction between weight decay, normalization layers, and learning rate schedules, and proposes a simple correction that reduces loss values throughout training.
During long-duration Large Language Model (LLM) training runs the gradient norm increases rapidly near the end of training. In this short note, we show that this increase is due to an unintended interaction between weight decay, normalization layers, and the learning rate schedule. We propose a simple correction that fixes this behavior while also resulting in lower loss values throughout training.