CLAug 14, 2025

Hell or High Water: Evaluating Agentic Recovery from External Failures

arXiv:2508.11027v13 citationsh-index: 7Has Code
Originality Synthesis-oriented
AI Analysis

This addresses a critical issue for deploying AI agents in real-world, dynamic environments, though it is incremental as it focuses on benchmarking and analysis rather than proposing a new solution.

The paper tackles the problem of language model agents failing to adapt to external disruptions in planning tasks, finding that state-of-the-art models struggle to formulate backup plans and adapt to environmental feedback, with performance often unaffected by failures but lacking robustness.

As language model agents are applied to real world problems of increasing complexity, they will be expected to formulate plans across large search spaces. If those plans fail for reasons beyond their control, how well do language agents search for alternative ways to achieve their goals? We devise a specialized agentic planning benchmark to study this question. Each planning problem is solved via combinations of function calls. The agent searches for relevant functions from a set of over four thousand possibilities, and observes environmental feedback in the form of function outputs or error messages. Our benchmark confronts the agent with external failures in its workflow, such as functions that suddenly become unavailable. At the same time, even with the introduction of these failures, we guarantee that the task remains solvable. Ideally, an agent's performance on the planning task should not be affected by the presence of external failures. Overall, we find that language agents struggle to formulate and execute backup plans in response to environment feedback. While state-of-the-art models are often able to identify the correct function to use in the right context, they struggle to adapt to feedback from the environment and often fail to pursue alternate courses of action, even when the search space is artificially restricted. We provide a systematic analysis of the failures of both open-source and commercial models, examining the effects of search space size, as well as the benefits of scaling model size in our setting. Our analysis identifies key challenges for current generative models as well as promising directions for future work.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes