PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners
For LLM reasoning, PAINT offers a more effective self-distillation method that outperforms strong baselines, though it is an incremental improvement over existing on-policy self-distillation.
PAINT improves LLM reasoning by adaptively masking verified solutions and interpolating token-level training signals, achieving 2.1 points higher macro Avg@12 over prior self-distillation and 2.9 points over GRPO on Qwen3-8B across competition-level math benchmarks.
Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit; supervised fine-tuning and distillation provide dense targets but often train on fixed trajectories or rely on stronger teachers. Recent privileged on-policy self-distillation explores a middle ground by scoring student rollouts with the same model under verified solution context. We revisit this setting through a contextual re-scoring lens: for reasoning, the important choices are not only whether privileged context is available, but how much of it should be revealed and where its distribution should shape the student. We propose PAINT (Partial-solution Adaptive INterpolated Training), which masks the verified solution according to rollout-reference overlap and applies a small energy-space interpolation on a sparse set of entropy-mismatch token positions. Across competition-level math benchmarks, PAINT consistently improves over a strong prior on-policy self-distillation baseline at all three Qwen3 scales. On Qwen3-8B, it raises macro Avg@12 by 2.1 points over this prior baseline and 2.9 points over GRPO.