AINov 10, 2024

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su

Microsoft

arXiv:2411.06559v236.1114 citationsh-index: 30Has CodeTrans. Mach. Learn. Res.

Originality Highly original

AI Analysis

This work addresses the challenge of efficient and effective planning for web agents in complex, irreversible environments, offering a novel approach that could enhance automation in web-based tasks.

The paper tackles the problem of irreversible actions in real-world web environments that undermine backtracking-based planning for web agents, proposing a model-based planning framework called WebDreamer that uses LLMs as world models and value functions, achieving substantial performance improvements over reactive baselines and competitive efficiency with tree search methods.

Language agents based on large language models (LLMs) have demonstrated great promise in automating web-based tasks. Recent work has shown that incorporating advanced planning algorithms, e.g., tree search, is advantageous over reactive planning for web agents. However, unlike simulated sandbox environments, real-world environments such as the web are rife with irreversible actions. This undermines the feasibility of backtracking, a cornerstone of (tree) search. Overly relying on test-time search also hurts efficiency. We advocate model-based planning for web agents that employs a world model to simulate and deliberate over the outcome of each candidate action before committing to one. We systematically explore this paradigm by (1) Proposing a model-based planning framework, WebDreamer, which employs LLMs to serve as both world models and value functions; (2) Training specialized LLMs as world models with a scalable data synthesis pipeline. Empirical results demonstrate that WebDreamer achieves substantial performance improvements over reactive baselines. It is competitive, while being 4-5 times more efficient, with tree search in sandbox environments (VisualWebArena) and also works effectively on real-world websites (Online-Mind2Web and Mind2Web-Live). Furthermore, our trained world model, Dreamer-7B, performs comparable to GPT-4o, highlighting the potential of specialized world models for efficient and effective planning in complex web environments.

View on arXiv PDF Code

Similar