AIJun 11

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Jingxuan Han, Wei Liu, Mingyang Zhu, Youpeng Wang, Ziwen Wang, Lin Qiu, Xuezhi Cao, Xunliang Cai, Zheren Fu, Licheng Zhang, Zhendong Mao

arXiv:2606.12871v121.7Has Code

Predicted impact top 15% in AI · last 90 daysOriginality Incremental advance

AI Analysis

Provides a more realistic and interpretable evaluation for search agents on everyday information-seeking tasks, addressing limitations of prior specialized benchmarks.

DailyReport introduces a benchmark of 150 open-ended daily search tasks with 3,546 rubrics to evaluate search agents, finding that current systems fall short of user expectations.

Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.

View on arXiv PDF Code

Similar