LGAIMay 26

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

arXiv:2605.2776582.2h-index: 5
AI Analysis

For researchers improving LLM reasoning via reinforcement learning, this provides a simple, zero-cost fix to SDPO that yields consistent improvements.

The paper identifies that SDPO lacks difficulty awareness compared to GRPO, and proposes SC-SDPO which weights each question's loss by sqrt(p(1-p)) to restore an implicit curriculum. Experiments show consistent gains of +3.2/+4.3 on Qwen3-8B and +1.8/+3.0 on OLMo-3-7B over SDPO.

Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model's own feedback-conditioned predictions as a self-teacher. Unlike GRPO, however, whose group-relative advantage naturally concentrates learning on a sweet spot of intermediate-difficulty questions, SDPO's KL-based advantage lacks an implicit notion of difficulty awareness. We analyze this gap through the lens of GRPO's advantage normalization. Extending the learnability framework to normalized rewards, we show that normalization absorbs the variance term $p(1-p)$, equalizing leading-order learnability across questions and leaving $\sqrt{p(1-p)}$ as the sole residual scaling factor in the per-question gradient. This analysis yields a simple prescription: weight each question's SDPO loss by $[\hat{p}(1-\hat{p})]^{1/2}$, resulting in SC-SDPO, a scale-consistent variant of SDPO. The proposed weights are obtained as a zero-cost byproduct of on-policy rollouts with batch-adaptive normalization, inducing an implicit curriculum that dynamically tracks the model's evolving competence. Experiments on scientific reasoning and tool-use benchmarks demonstrate that SC-SDPO consistently improves over SDPO, yielding gains of +3.2/+4.3 (mean@16/maj@16) on Qwen3-8B and +1.8/+3.0 on OLMo-3-7B, while preserving stable training dynamics throughout optimization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes