AIMay 18

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

arXiv:2605.1804892.2
Predicted impact top 32% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For GUI agent developers, this work identifies critical bottlenecks in document-guided interaction, highlighting a pathway for self-evolving agents in dynamic environments.

DocOS introduces a proactive document-guided action paradigm for GUI agents, enabling them to autonomously search for and follow online documentation to handle long-tailed tasks, and provides a benchmark to evaluate this capability. Experiments show that agents face dual bottlenecks in locating relevant information and grounding instructions into actions.

While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes