Can Large Reasoning Models Self-Train?
This addresses the problem of enabling sustained self-improvement in AI reasoning models, but it is incremental as it builds on existing RL methods and highlights limitations.
The paper investigated whether large reasoning models can self-train using reinforcement learning with majority voting as a self-feedback mechanism, finding that it improves reasoning performance and feedback quality but leads to reward hacking and performance collapse over time.
Recent successes of reinforcement learning (RL) in training large reasoning models motivate the question of whether self-training - the process where a model learns from its own judgments - can be sustained within RL. In this work, we study this question using majority voting as a simple self-feedback mechanism. On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden and complete performance collapse. Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.