CLNov 26, 2024

The Extractive-Abstractive Spectrum: Uncovering Verifiability Trade-offs in LLM Generations

Theodora Worledge, Tatsunori Hashimoto, Carlos Guestrin

arXiv:2411.17375v16.112 citationsh-index: 19Has Code

Originality Incremental advance

AI Analysis

This addresses the issue of unreliable citation in LLMs for users needing trustworthy information in high-stakes domains, though it is incremental in exploring intermediate points on a spectrum.

The paper tackles the problem of balancing verifiability and utility in information-sharing tools like LLMs and search engines, finding that as outputs become more abstractive, perceived utility improves by up to 200% but proper citation decreases by up to 50% and verification time increases up to 3 times.

Across all fields of academic study, experts cite their sources when sharing information. While large language models (LLMs) excel at synthesizing information, they do not provide reliable citation to sources, making it difficult to trace and verify the origins of the information they present. In contrast, search engines make sources readily accessible to users and place the burden of synthesizing information on the user. Through a survey, we find that users prefer search engines over LLMs for high-stakes queries, where concerns regarding information provenance outweigh the perceived utility of LLM responses. To examine the interplay between verifiability and utility of information-sharing tools, we introduce the extractive-abstractive spectrum, in which search engines and LLMs are extreme endpoints encapsulating multiple unexplored intermediate operating points. Search engines are extractive because they respond to queries with snippets of sources with links (citations) to the original webpages. LLMs are abstractive because they address queries with answers that synthesize and logically transform relevant information from training and in-context sources without reliable citation. We define five operating points that span the extractive-abstractive spectrum and conduct human evaluations on seven systems across four diverse query distributions that reflect real-world QA settings: web search, language simplification, multi-step reasoning, and medical advice. As outputs become more abstractive, we find that perceived utility improves by as much as 200%, while the proportion of properly cited sentences decreases by as much as 50% and users take up to 3 times as long to verify cited information. Our findings recommend distinct operating points for domain-specific LLM systems and our failure analysis informs approaches to high-utility LLM systems that empower users to verify information.

View on arXiv PDF Code

Similar