Delusions of Large Language Models
This addresses the problem of unreliable AI outputs for users of large language models, though it appears incremental by focusing on a specific subtype of hallucinations.
The paper identifies and studies LLM delusions, defined as factually incorrect outputs with abnormally high confidence that are harder to detect than ordinary hallucinations. Through empirical analysis on Question Answering tasks, they show delusions are prevalent, distinct from hallucinations, and harder to mitigate via finetuning or self-reflection.
Large Language Models often generate factually incorrect but plausible outputs, known as hallucinations. We identify a more insidious phenomenon, LLM delusion, defined as high belief hallucinations, incorrect outputs with abnormally high confidence, making them harder to detect and mitigate. Unlike ordinary hallucinations, delusions persist with low uncertainty, posing significant challenges to model reliability. Through empirical analysis across different model families and sizes on several Question Answering tasks, we show that delusions are prevalent and distinct from hallucinations. LLMs exhibit lower honesty with delusions, which are harder to override via finetuning or self reflection. We link delusion formation with training dynamics and dataset noise and explore mitigation strategies such as retrieval augmented generation and multi agent debating to mitigate delusions. By systematically investigating the nature, prevalence, and mitigation of LLM delusions, our study provides insights into the underlying causes of this phenomenon and outlines future directions for improving model reliability.