Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
This addresses a bottleneck in training LLM agents for software engineering by providing more nuanced feedback during multi-step interactions, though it appears incremental as it builds on existing reinforced fine-tuning methods.
The paper tackles the problem of limited guidance from binary terminal rewards in fine-tuning LLM agents for software engineering tasks by introducing a rubric-based Generative Reward Model that provides richer learning signals. The approach outperforms terminal-score-only rejection sampling by suppressing undesirable behavioral patterns and promoting beneficial ones, ultimately improving final test accuracy.
Despite recent progress in Large Language Model (LLM) Agents for Software Engineering (SWE) tasks, end-to-end fine-tuning typically relies on verifiable terminal rewards such as whether all unit tests pass. While these binary signals reflect whether the final solution is correct, they provide little guidance for shaping intermediate behaviors during multi-step interactions, thereby limiting improvements in the overall quality of the resolution process. To address this, we introduce a rubric-based Generative Reward Model (GRM) that provides richer learning signals. The GRM is equipped with human-designed rubrics that indicate criteria for encouraging or discouraging specific behavioral patterns, and we leverage this feedback for high-quality training data collection via trajectory filtration. When used for Reinforced Fine-Tuning (RFT) on SWE Tasks, our approach outperforms terminal-score-only rejection sampling: it more effectively suppresses undesirable patterns while promoting beneficial ones, as confirmed by case analyses, and it ultimately improves final test accuracy.