AICLJan 29

Exploring Reasoning Reward Model for Agents

arXiv:2601.22154v16 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the challenge of suboptimal training in agentic systems by offering more nuanced feedback, representing an incremental advancement in reward modeling for AI agents.

The paper tackles the problem of sparse outcome-based rewards in Agentic Reinforcement Learning by introducing Agent-RRM, a multi-faceted reward model that provides structured feedback on reasoning quality, resulting in substantial performance improvements such as 43.7% on GAIA and 46.2% on WebWalkerQA.

Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes