LG AI CLJan 14

GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization

Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, Wentao Zhang

arXiv:2601.09233v13.82 citationsh-index: 2Has Code

Originality Highly original

AI Analysis

This addresses a key bottleneck in training large reasoning models, offering a principled solution for better global optimality in post-training pipelines.

The paper tackles the optimization mismatch in post-training for Large Reasoning Models by proposing Gibbs Initialization with Finite Temperature (GIFT), which reformulates Supervised Fine-Tuning to prevent distributional collapse and improve Reinforcement Learning initialization, resulting in significant performance gains over standard methods.

The prevailing post-training paradigm for Large Reasoning Models (LRMs)--Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)--suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post-training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post-training. Our code is available at https://github.com/zzy1127/GIFT.

View on arXiv PDF Code

Similar