LG CLDec 12, 2025

Rethinking Expert Trajectory Utilization in LLM Post-training

Bowen Ding, Yuhan Chen, Jiayang Lv, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, Tao Lin

arXiv:2512.11470v1h-index: 7

Originality Incremental advance

AI Analysis

This provides actionable guidelines for researchers and practitioners in AI to maximize performance in LLM post-training, though it is incremental as it refines existing methods rather than introducing a new paradigm.

The paper tackles the problem of optimizing expert trajectory utilization in LLM post-training by proposing the Plasticity-Ceiling Framework, establishing the Sequential SFT-then-RL pipeline as superior and deriving scaling guidelines that show data scale determines primary potential while trajectory difficulty acts as a multiplier.

While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.

View on arXiv PDF

Similar