Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio
This addresses hallucinations in ASR for speech recognition users, but is incremental as it focuses on post-processing for a specific model.
The paper tackled the problem of hallucinations in the Whisper ASR model induced by non-speech audio, showing that post-processing with a bag of hallucinations reduces word error rate as a safeguard.
Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently. We then study hallucinations caused by the augmentation of speech with such sounds. Finally, we describe the creation of a bag of hallucinations (BoH) that allows to remove the effect of hallucinations through the post-processing of text transcriptions. The results of our experiments show that such post-processing is capable of reducing word error rate (WER) and acts as a good safeguard against problematic hallucinations.