CLFeb 6, 2024

INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

arXiv:2402.03744v2306 citationsh-index: 10ICLR
AI Analysis

This addresses the reliability of deployed LLMs by offering a more effective hallucination detection method, though it is incremental as it builds on existing self-consistency approaches.

The paper tackles the problem of detecting knowledge hallucinations in large language models (LLMs) by proposing INSIDE, a method that uses internal states to retain semantic information, resulting in improved detection performance as shown through experiments on QA benchmarks.

Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs' \textbf{IN}ternal \textbf{S}tates for halluc\textbf{I}nation \textbf{DE}tection (\textbf{INSIDE}). In particular, a simple yet effective \textbf{EigenScore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes