CL AIMar 27

Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs

arXiv:2604.1976586.0h-index: 2

Predicted impact top 48% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This is an incremental finding that impacts developers of neuron-level hallucination detectors by showing they need domain-specific calibration.

The paper tackled the problem of whether hallucination neurons in large language models generalize across knowledge domains, finding they do not, with classifiers achieving AUROC 0.783 within-domain but only 0.563 when transferred across domains.

Recent work identifies a sparse set of "hallucination neurons" (H-neurons), less than 0.1% of feed-forward network neurons, that reliably predict when large language models will hallucinate. These neurons are identified on general-knowledge question answering and shown to generalize to new evaluation instances. We ask a natural follow-up question: do H-neurons generalize across knowledge domains? Using a systematic cross-domain transfer protocol across 6 domains (general QA, legal, financial, science, moral reasoning, and code vulnerability) and 5 open-weight models (3B to 8B parameters), we find they do not. Classifiers trained on one domain's H-neurons achieve AUROC 0.783 within-domain but only 0.563 when transferred to a different domain (delta = 0.220, p < 0.001), a degradation consistent across all models tested. Our results suggest that hallucination is not a single mechanism with a universal neural signature, but rather involves domain-specific neuron populations that differ depending on the knowledge type being queried. This finding has direct implications for the deployment of neuron-level hallucination detectors, which must be calibrated per domain rather than trained once and applied universally.

View on arXiv PDF

Similar