The Trilemma of Truth in Large Language Models
This work addresses the challenge of accurately assessing truth in LLMs, which is crucial for improving their reliability in applications like information retrieval and fact-checking, though it is incremental as it builds on existing probing techniques.
The study tackled the problem of probing the veracity of knowledge in large language models (LLMs) by identifying flaws in existing methods and introducing sAwMIL, a multiclass probing framework that classifies statements as true, false, or neither, showing that common methods fail to provide reliable veracity directions and that truth and falsehood are not encoded symmetrically.
The public often attributes human-like qualities to large language models (LLMs) and assumes they "know" certain things. In reality, LLMs encode information retained during training as internal probabilistic knowledge. This study examines existing methods for probing the veracity of that knowledge and identifies several flawed underlying assumptions. To address these flaws, we introduce sAwMIL (Sparse-Aware Multiple-Instance Learning), a multiclass probing framework that combines multiple-instance learning with conformal prediction. sAwMIL leverages internal activations of LLMs to classify statements as true, false, or neither. We evaluate sAwMIL across 16 open-source LLMs, including default and chat-based variants, on three new curated datasets. Our results show that (1) common probing methods fail to provide a reliable and transferable veracity direction and, in some settings, perform worse than zero-shot prompting; (2) truth and falsehood are not encoded symmetrically; and (3) LLMs encode a third type of signal that is distinct from both true and false.