SD AI ASJan 20, 2025

Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio

Mateusz Barański, Jan Jasiński, Julitta Bartolewska, Stanisław Kacprzak, Marcin Witkowski, Konrad Kowalczyk

arXiv:2501.11378v126.725 citationsh-index: 6ICASSP

Originality Synthesis-oriented

AI Analysis

This addresses hallucinations in ASR for speech recognition users, but is incremental as it focuses on post-processing for a specific model.

The paper tackled the problem of hallucinations in the Whisper ASR model induced by non-speech audio, showing that post-processing with a bag of hallucinations reduces word error rate as a safeguard.

Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently. We then study hallucinations caused by the augmentation of speech with such sounds. Finally, we describe the creation of a bag of hallucinations (BoH) that allows to remove the effect of hallucinations through the post-processing of text transcriptions. The results of our experiments show that such post-processing is capable of reducing word error rate (WER) and acts as a good safeguard against problematic hallucinations.

View on arXiv PDF

Similar