CLMay 19, 2025

Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down

arXiv:2505.12969v12 citationsh-index: 31INTERSPEECH
Originality Incremental advance
AI Analysis

This addresses a critical limitation for deploying Whisper in complex industrial settings where non-speech artifacts are common.

The paper tackled Whisper's hallucination issue in non-speech segments by identifying and fine-tuning three problematic self-attention heads, achieving over 80% reduction in hallucinations with less than 0.1% WER degradation on LibriSpeech.

OpenAI's Whisper has achieved significant success in Automatic Speech Recognition. However, it has consistently been found to exhibit hallucination issues, particularly in non-speech segments, which limits its broader application in complex industrial settings. In this paper, we introduce a novel method to reduce Whisper's hallucination on non-speech segments without using any pre- or post-possessing techniques. Specifically, we benchmark the contribution of each self-attentional head in the Whisper-large-v3 decoder to the hallucination problem by performing a head-wise mask. Our findings reveal that only 3 of the 20 heads account for over 75% of the hallucinations on the UrbanSound dataset. We then fine-tune these three crazy heads using a collection of non-speech data. The results show that our best fine-tuned model, namely Calm-Whisper, achieves over 80% reduction in non-speech hallucination with only less than 0.1% WER degradation on LibriSpeech test-clean and test-other.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes