LGCLMay 11

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

arXiv:2605.1078198.41 citations
Predicted impact top 1% in LG · last 90 daysOriginality Highly original
AI Analysis

For practitioners of RL-based post-training of LLMs, this work introduces a principled new design axis (information asymmetry) that improves reasoning exploration beyond existing self-distillation and exploration methods.

The paper proposes RLRT, a method that reverses teacher signals in self-distilled RLVR to reinforce student tokens that diverge from teacher predictions on correct rollouts, achieving substantial gains over baselines across multiple Qwen3 checkpoints.

Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes