Why Fine-Tuning Encourages Hallucinations and How to Fix It
For practitioners fine-tuning LLMs, this work provides a practical method to reduce hallucinations while learning new facts, addressing a critical reliability issue.
Supervised fine-tuning (SFT) of large language models increases hallucinations by degrading pre-trained knowledge. The authors propose a self-distillation-based SFT method that reduces hallucinations by regularizing output-distribution drift, and show that freezing parameters preserves performance when new knowledge is unnecessary.
Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.