Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses
This work addresses reliability concerns for users of AI-driven search tools, though it is incremental in proposing evaluation metrics rather than solving the core problems.
The study evaluated limitations of AI-based answer engines compared to traditional search engines, finding issues like frequent hallucination and inaccurate citations, and proposed design recommendations with metrics validated on three popular engines.
Large Language Model (LLM)-based applications are graduating from research prototypes to products serving millions of users, influencing how people write and consume information. A prominent example is the appearance of Answer Engines: LLM-based generative search engines supplanting traditional search engines. Answer engines not only retrieve relevant sources to a user query but synthesize answer summaries that cite the sources. To understand these systems' limitations, we first conducted a study with 21 participants, evaluating interactions with answer vs. traditional search engines and identifying 16 answer engine limitations. From these insights, we propose 16 answer engine design recommendations, linked to 8 metrics. An automated evaluation implementing our metrics on three popular engines (You.com, Perplexity.ai, BingChat) quantifies common limitations (e.g., frequent hallucination, inaccurate citation) and unique features (e.g., variation in answer confidence), with results mirroring user study insights. We release our Answer Engine Evaluation benchmark (AEE) to facilitate transparent evaluation of LLM-based applications.