AIFeb 18

SourceBench: Can AI Answers Reference Quality Web Sources?

Hexi Jin, Stephen Liu, Yuheng Li, Simran Malik, Yiying Zhang

arXiv:2602.16942v1

Originality Incremental advance

AI Analysis

This work addresses the need for better evidence quality assessment in AI-generated answers, which is incremental as it builds on existing evaluation methods by focusing on source quality rather than just correctness.

The paper tackles the problem of evaluating the quality of web sources cited by AI answers, introducing SourceBench, a benchmark with 100 real-world queries and an eight-metric framework, and reveals four key insights from evaluating eight LLMs and other tools over 3996 cited sources.

Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.

View on arXiv PDF

Similar