LGAIJan 12

Stable On-Policy Distillation through Adaptive Target Reformulation

arXiv:2601.07155v116 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses a key bottleneck in efficiently deploying large language models by enabling more stable and effective distillation to smaller models, though it is incremental as it builds on existing on-policy distillation methods.

The paper tackled the problem of training instability in on-policy knowledge distillation for language models, caused by distributional gaps between student and teacher models, and proposed Veto, an objective-level reformulation that stabilizes optimization and improves performance, as demonstrated by consistent outperformance over baselines in reasoning and generation tasks.

Knowledge distillation (KD) is a widely adopted technique for transferring knowledge from large language models to smaller student models; however, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities because the distributional gap between the novice student and the expert teacher is often too wide to bridge directly. These challenges manifest as pathological gradients in forward KL objectives or diversity collapse in reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric bridge in the logit space. Unlike prior methods that mix data samples, Veto creates an intermediate target distribution that promotes alignment between the teacher and the student. By introducing a tunable parameter beta, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity. Extensive experiments across various reasoning and generation tasks demonstrate that Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes