Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection
This work addresses efficiency bottlenecks in hallucination detection for LLM applications, offering a domain-agnostic improvement that is incremental but practical.
The paper tackles the high computational cost of self-consistency methods for hallucination detection in LLMs by identifying redundancy in shared prefix tokens, proposing a Decoding Memory Pipeline that accelerates generation through selective inference and annealed decoding, achieving up to a 3x speedup without performance loss.
Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.