GNMR: Runtime Stability Control for Low-Precision Large Language Model Training

Boao Kong, Weichen Jia, Engao Zhang, Guohong Li, Yonghan Dong, Yao Wang, Yaoyuan Wang, Yunke Peng, Kun Yuan

arXiv:2606.0053972.8h-index: 4

AI Analysis

For practitioners of low-precision LLM training, GNMR provides a backend-agnostic solution to improve stability without changing numerical formats or kernels, addressing a key bottleneck in efficient training.

The paper tackles runtime stability control in low-precision LLM training, introducing GNMR, a lightweight controller that uses gradient norm-to-mean ratio and delta-GNMR to detect and recover from numerical risks. It preserves high-fidelity quality with sparse recovery across activation-quantization stress, DeepSeek-style training, and LLaMA-2 13B fine-tuning.

Training stability is a key bottleneck in low-precision language model training: efficient low-cost paths can still produce short-lived numerical risks at a small set of operators. We formulate this as runtime stability control and present Gradient Norm-to-Mean Ratio (GNMR), a lightweight controller that compares each recoverable unit's current gradient norm with its historical mean. Together with $Δ$-GNMR for abrupt short-window increases, GNMR maps local risk signals to bounded recovery actions under a hard $\mathrm{maxO}$ budget and a short lock interval, without changing the numerical format, kernel, or backend recipe. Across activation-quantization stress, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning, GNMR preserves high-fidelity quality with sparse, budgeted recovery. These results support GNMR as a backend-agnostic controller to improve low-precision training stability while preserving low-cost execution.

View on arXiv PDF

Similar