ScRPO: From Errors to Insights
This addresses the problem of improving AI reliability on difficult tasks with limited feedback, though it appears incremental as it builds on existing methods like GRPO.
The paper tackles the problem of enhancing large language models on challenging mathematical problems by proposing ScRPO, a reinforcement learning framework that uses self-reflection and error correction, and it demonstrates consistent outperformance over several post-training methods on benchmarks like AIME and GSM8k.
We propose Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to enhance large language models on challenging mathematical problems by leveraging self-reflection and error correction. Our approach consists of two stages: (1) Trial-and-error learning stage: training the model with GRPO and collecting incorrect answers along with their corresponding questions in an error pool; (2) Self-correction learning stage: guiding the model to reflect on why its previous answers were wrong. Extensive experiments across multiple math reasoning benchmarks, including AIME, AMC, Olympiad, MATH-500, GSM8k, using Deepseek-Distill-Qwen-1.5B and Deepseek-Distill-Qwen-7B. The experimental results demonstrate that ScRPO consistently outperforms several post-training methods. These findings highlight ScRPO as a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way toward more reliable and capable AI systems.