CL AIMay 10

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang

arXiv:2605.0925370.7

AI Analysis

For researchers in reinforcement learning and model distillation, this work reveals inefficiencies in uniform token weighting and proposes a more efficient distillation paradigm.

The paper identifies 'Rock Tokens' in On-Policy Distillation (OPD) that persistently exhibit high loss despite training saturation, accounting for up to 18% of tokens. These tokens contribute disproportionately to gradient norms but provide negligible functional benefit, and bypassing them streamlines alignment.

While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.

View on arXiv PDF

Similar