AICLMar 15

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

arXiv:2603.1424834.93 citationsh-index: 4
Predicted impact top 11% in AI · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the reliability gap in LLM web agents for web navigation, offering a diagnostic tool for researchers and developers, though it is incremental in providing a framework rather than a solution.

The paper tackled the problem of LLM-based web agents failing on long-horizon tasks by analyzing failures through a hierarchical planning framework, finding that low-level execution is the main bottleneck and that structured PDDL plans outperform natural language plans in goal-directed strategies.

Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes