CLAIJan 13, 2025

WebWalker: Benchmarking LLMs in Web Traversal

arXiv:2501.07572v3140 citationsh-index: 29ACL
AI Analysis

This work addresses the challenge of shallow content retrieval for LLMs in web-based tasks, though it appears incremental as it builds on existing RAG methods.

The paper tackles the problem of limited LLM performance in complex web traversal by introducing WebWalkerQA, a benchmark for evaluating LLM abilities, and WebWalker, a multi-agent framework that improves retrieval-augmented generation, showing effectiveness in real-world scenarios.

Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes