IRAIDBJul 1, 2025

WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

arXiv:2507.00938v2h-index: 7
Originality Incremental advance
AI Analysis

This work provides a reproducible benchmark for evaluating web agents, addressing instability in existing methods, but it is incremental as it builds on prior evaluation frameworks.

The authors tackled the challenge of evaluating multimodal web agents by introducing WebArXiv, a static benchmark with 275 tasks on arXiv snapshots, and identified a failure mode called Rigid History Reflection, proposing a dynamic reflection mechanism that improved agent performance.

Recent progress in large language models (LLMs) has enabled the development of autonomous web agents capable of navigating and interacting with real websites. However, evaluating such agents remains challenging due to the instability and inconsistency of existing benchmarks, which often rely on dynamic content or oversimplified simulations. In this work, we introduce WebArXiv, a static and time-invariant benchmark comprising 275 web-based tasks grounded in the arXiv platform. WebArXiv ensures reproducible and reliable evaluation by anchoring tasks in fixed web snapshots with deterministic ground truths and standardized action trajectories. Through behavioral analysis, we identify a common failure mode, Rigid History Reflection, where agents over-rely on fixed interaction histories. To address this, we propose a lightweight dynamic reflection mechanism that allows agents to selectively retrieve relevant past steps during decision-making. We evaluate ten state-of-the-art web agents on WebArXiv. Results demonstrate clear performance differences across agents and validate the effectiveness of our proposed reflection strategy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes