CLAIOct 25, 2024

OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization

Tencent
arXiv:2410.19609v136 citationsh-index: 18Has CodeACL
Originality Incremental advance
AI Analysis

This work addresses the challenge of building generalizable multimodal agents for real-world web navigation, which is an incremental advancement over existing text-only agents in synthetic environments.

The paper tackles the problem of developing autonomous multimodal web agents that can navigate real-world scenarios by introducing an open-source framework that uses iterative real-world exploration, feedback, and optimization, resulting in strong performance improvements across multiple test sets after each iteration.

The rapid development of large language and multimodal models has sparked significant interest in using proprietary models, such as GPT-4o, to develop autonomous agents capable of handling real-world scenarios like web navigation. Although recent open-source efforts have tried to equip agents with the ability to explore environments and continuously improve over time, they are building text-only agents in synthetic environments where the reward signals are clearly defined. Such agents struggle to generalize to realistic settings that require multimodal perception abilities and lack ground-truth signals. In this paper, we introduce an open-source framework designed to facilitate the development of multimodal web agent that can autonomously conduct real-world exploration and improve itself. We first train the base model with imitation learning to gain the basic abilities. We then let the agent explore the open web and collect feedback on its trajectories. After that, it further improves its policy by learning from well-performing trajectories judged by another general-purpose model. This exploration-feedback-optimization cycle can continue for several iterations. Experimental results show that our web agent successfully improves itself after each iteration, demonstrating strong performance across multiple test sets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes