HCMar 12

To Believe or Not To Believe: Comparing Supporting Information Tools to Aid Human Judgments of AI Veracity

arXiv:2603.11393v113.0h-index: 5
Predicted impact top 23% in HC · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the problem of AI hallucination risks for users in sectors like biomedical research and law who need to assess AI-generated information, though it is incremental as it builds on prior work on supporting tools.

The study compared three supporting information tools (full source text, passage retrieval, and LLM explanations) to help users judge the veracity of AI-generated answers in a data extraction context, finding that passage retrieval balanced accuracy and speed while LLM explanations led to inappropriate reliance and reduced error detection.

With increasing awareness of the hallucination risks of generative artificial intelligence (AI), we see a growing shift toward providing information tooling to help users determine the veracity of AI-generated answers for themselves. User responsibility for assessing veracity is particularly critical for certain sectors that rely on on-demand, AI-generated data extraction, such as biomedical research and the legal sector. While prior work offers us a variety of ways in which systems can provide such support, there is a lack of empirical evidence on how this information is actually incorporated into the user's decision-making process. Our user study takes a step toward filling this knowledge gap. In the context of a generative AI data extraction tool, we examine the relationship between the type of supporting information (full source text, passage retrieval, and Large Language Model (LLM) explanations) and user behavior in the veracity assessment process, examined through the lens of efficiency, effectiveness, reliance and trust. We find that passage retrieval offers a reasonable compromise between accuracy and speed, with judgments of veracity comparable to using the full source text. LLM explanations, while also enabling rapid assessments, fostered inappropriate reliance and trust on the data extraction AI, such that participants were less likely to detect errors. In additiona, we analyzed the impacts of the complexity of the information need, finding preliminary evidence that inappropriate reliance is worse for complex answers. We demonstrate how, through rigorous user evaluation, we can better develop systems that allow for effective and responsible human agency in veracity assessment processes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes