PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
Provides a fine-grained diagnostic framework for understanding where and why hallucinations occur in LLMs, addressing a gap in existing benchmarks that only offer output-level severity scores.
PRISM introduces a diagnostic benchmark that disentangles LLM hallucinations into four dimensions (knowledge missing, knowledge errors, reasoning errors, instruction-following errors) across 9,448 instances and 65 tasks. Evaluation of 24 LLMs reveals trade-offs where mitigation strategies improve specific dimensions at the expense of others.
As large language models (LLMs) evolve from conversational assistants into agents capable of handling complex tasks, they are increasingly deployed in high-risk domains. However, existing benchmarks largely rely on mixed queries and posterior evaluation, output-level scoring, which quantifies hallucination severity but offers limited insight into where and why hallucinations arise in the generation pipeline. We therefore reformulate hallucination evaluation as a diagnostic problem and propose PRISM, a controlled benchmark that disentangles hallucinations into four dimensions: knowledge missing, knowledge errors, reasoning errors, and instruction-following errors, grounded in three stages of generation (memory, instruction, and reasoning). PRISM contains 9,448 instances across 65 tasks and supports fine-grained, stage-aware diagnostic evaluation. Evaluating 24 mainstream open-source and proprietary LLMs, we uncover consistent trade-offs across instruction following, memory retrieval, and logical reasoning, showing that mitigation strategies often improve specific dimensions at the expense of others. We hope PRISM provides a framework for understanding the specific mechanisms behind LLMs hallucinations, ultimately accelerating the development of trustworthy large language models.