SDAIASJan 20, 2025

Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio

arXiv:2501.11378v125 citationsh-index: 6ICASSP
Originality Synthesis-oriented
AI Analysis

This addresses hallucinations in ASR for speech recognition users, but is incremental as it focuses on post-processing for a specific model.

The paper tackled the problem of hallucinations in the Whisper ASR model induced by non-speech audio, showing that post-processing with a bag of hallucinations reduces word error rate as a safeguard.

Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently. We then study hallucinations caused by the augmentation of speech with such sounds. Finally, we describe the creation of a bag of hallucinations (BoH) that allows to remove the effect of hallucinations through the post-processing of text transcriptions. The results of our experiments show that such post-processing is capable of reducing word error rate (WER) and acts as a good safeguard against problematic hallucinations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes