IR AIMay 22, 2025

DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes

arXiv:2505.17162v13 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for better benchmarks to assess LLMs and retrieval-augmented generation (RAG) systems on real-world, fast-changing information, though it is incremental as it builds on existing RAG and benchmarking approaches.

The authors tackled the problem of evaluating large language models (LLMs) on time-sensitive factual data by introducing DailyQA, a dynamic benchmark updated weekly from Wikipedia logs, and found that reranking web retrieval results is critical, with LLMs still facing significant challenges in handling frequently updated information.

We propose DailyQA, an automatically updated dynamic dataset that updates questions weekly and contains answers to questions on any given date. DailyQA utilizes daily updates from Wikipedia revision logs to implement a fully automated pipeline of data filtering, query generation synthesis, quality checking, answer extraction, and query classification. The benchmark requires large language models (LLMs) to process and answer questions involving fast-changing factual data and covering multiple domains. We evaluate several open-source and closed-source LLMs using different RAG pipelines with web search augmentation. We compare the ability of different models to process time-sensitive web information and find that rerank of web retrieval results is critical. Our results indicate that LLMs still face significant challenges in handling frequently updated information, suggesting that DailyQA benchmarking provides valuable insights into the direction of progress for LLMs and RAG systems.

View on arXiv PDF

Similar