RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning
This work addresses the problem of high computational costs in self-supervised reasoning for large language models, offering a more scalable approach for researchers and practitioners, though it is incremental as it builds on existing TTRL methods.
The paper tackles the computational inefficiency of test-time reinforcement learning (TTRL) for improving reasoning in large language models by proposing RoiRL, an offline iterative reinforcement learning method that eliminates the need for a reference model and reduces memory and compute requirements, achieving 2.5x faster training and outperforming TTRL on reasoning benchmarks.
Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.