Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents
For researchers and developers of LLM agents, APB provides a diagnostic tool to identify and improve planning-specific capabilities, addressing a gap in existing end-to-end evaluations.
The paper introduces the Agent Planning Benchmark (APB), a diagnostic benchmark with 4,209 multimodal cases across 22 domains to isolate planning failures from execution failures in LLM agents. Testing 12 MLLMs reveals systematic weaknesses in long-horizon planning, tool-noise robustness, and calibrated refusal, and APB-guided refinement improves plan correctness and downstream execution metrics on ToolSandbox and τ²-bench tasks.
Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce \textbf{Agent Planning Benchmark (APB)}, a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 $τ^2$-bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks.