Probing for Knowledge Attribution in Large Language Models
This work provides a method for understanding the source of knowledge in LLM outputs, which is important for developers and users trying to diagnose and mitigate hallucinations, particularly by distinguishing between faithfulness and factuality violations.
This paper addresses the problem of identifying whether a large language model's output is based on the input prompt or its internal knowledge, a crucial step for mitigating hallucinations. The authors developed a linear classifier probe that, when trained on their self-supervised AttriWiki dataset, reliably predicts the dominant knowledge source, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, and transferring to out-of-domain benchmarks with 0.94-0.99 Macro-F1.
Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.