AIJun 4

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

arXiv:2606.0580691.4Has Code
Predicted impact top 12% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For researchers developing LLM-based agents, this work highlights dynamic replanning as a critical bottleneck not addressed by scaling or prompting, providing a benchmark to measure and improve robustness to real-world tool failures.

The paper introduces ToolMaze, a benchmark for evaluating LLM agents' ability to handle tool failures through dynamic replanning. Results show that perturbations degrade performance significantly, with implicit semantic failures causing a 37% drop in Perturbation Recovery Rate, and that agentic fault-tolerance improves with model scale 3.66× slower than basic task execution.

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes