Probing LLM Hallucination from Within: Perturbation-Driven Approach via Internal Knowledge
This addresses the critical challenge of LLM hallucination for practical applications by providing a more accurate detection method without needing external knowledge or supervised training.
The paper tackles the problem of LLM hallucination by introducing a new task called hallucination probing that classifies generated text into aligned, misaligned, and fabricated categories, and proposes SHINE, a method that achieves state-of-the-art performance in hallucination detection, outperforming seven competing methods across four datasets and four LLMs.
LLM hallucination, where unfaithful text is generated, presents a critical challenge for LLMs' practical applications. Current detection methods often resort to external knowledge, LLM fine-tuning, or supervised training with large hallucination-labeled datasets. Moreover, these approaches do not distinguish between different types of hallucinations, which is crucial for enhancing detection performance. To address such limitations, we introduce hallucination probing, a new task that classifies LLM-generated text into three categories: aligned, misaligned, and fabricated. Driven by our novel discovery that perturbing key entities in prompts affects LLM's generation of these three types of text differently, we propose SHINE, a novel hallucination probing method that does not require external knowledge, supervised training, or LLM fine-tuning. SHINE is effective in hallucination probing across three modern LLMs, and achieves state-of-the-art performance in hallucination detection, outperforming seven competing methods across four datasets and four LLMs, underscoring the importance of probing for accurate detection.