AIMar 15

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

arXiv:2603.1446591.86 citationsh-index: 15Has Code
AI Analysis

This work addresses the need for better evaluation of tool-using agents in dynamic, open-ended environments, which is an incremental step in improving agent reliability and fostering research in reward models.

The paper tackles the problem of evaluating step-level process quality in tool-using agents, which is critical due to irreversible failures in long-horizon interactions, and introduces AgentProcessBench, a benchmark with 1,000 trajectories and 8,509 human-labeled annotations showing 89.1% inter-annotator agreement, revealing insights such as inflated correct step ratios in weaker models and the complementary value of process signals to outcome supervision.

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes