LGAIMay 27

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

arXiv:2605.2839675.2
AI Analysis

For practitioners of on-policy distillation, ADWIN offers a principled way to trade off compute and accuracy without sacrificing performance.

ADWIN reduces the training cost of on-policy distillation by adaptively shortening rollout lengths, achieving up to 4.1× cost reduction with comparable or better accuracy on math and code reasoning tasks.

On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low marginal value for the current student. We revisit this assumption through the useful supervision horizon: student-induced rollouts can drift from teacher-preferred continuations, while aligned prefixes may already preserve the long-horizon OPD update direction. We propose ADWIN, an adaptive-window framework for OPD that treats rollout length as an online admissibility decision, training on short teacher-anchored prefixes while using delayed full-rollout probes to audit prefix--full alignment and adapt the next horizon with staleness control. Across math and code reasoning benchmarks in single-task, multi-task, and strong-to-weak settings, ADWIN improves the accuracy--compute trade-off over full-rollout OPD and prefix-based baselines, reducing end-to-end training cost by up to 4.1 times while achieving comparable or better accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes