LG CLMay 9

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Xin Li, Hao Jiang, Annan Wang, Yichi Zhang, Chau Yuen

arXiv:2605.0873735.6

Predicted impact top 9% in LG · last 90 daysOriginality Incremental advance

AI Analysis

Provides a theoretical and practical safety bound for on-policy distillation, crucial for practitioners training smaller models on structured-output tasks.

The paper identifies a 'extrapolation cliff' in on-policy distillation for LLMs, where exceeding a threshold reward coefficient causes format collapse in structured outputs. They derive a closed-form safety threshold and validate it on Amazon Fashion, achieving a 1.7B student model that matches an 8B teacher in-domain with one-fifth the parameters.

On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.

View on arXiv PDF

Similar