LG AI CLOct 30, 2025

Defeating the Training-Inference Mismatch via FP16

Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

arXiv:2510.26788v137 citationsh-index: 19

Originality Incremental advance

AI Analysis

This addresses instability in RL fine-tuning of LLMs, which is an incremental improvement through a simple precision change.

The paper identifies floating point precision as the root cause of instability in RL fine-tuning of LLMs, showing that simply switching from BF16 to FP16 eliminates the training-inference mismatch. This change yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms, and frameworks.

Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.

View on arXiv PDF

Similar