Adopting Whisper for Confidence Estimation
This work addresses confidence estimation for speech recognition systems, offering a novel method that improves out-of-domain robustness, though it is incremental as it builds on existing Whisper models.
The paper tackles the problem of word-level confidence estimation for speech recognition by proposing an end-to-end approach that fine-tunes the Whisper model to generate confidence scores, achieving similar performance to baseline methods on in-domain data and outperforming them on out-of-domain datasets, with the Whisper-large model showing substantial gains.
Recent research on word-level confidence estimation for speech recognition systems has primarily focused on lightweight models known as Confidence Estimation Modules (CEMs), which rely on hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. Specifically, we introduce a method in which the Whisper model is fine-tuned to produce scalar confidence scores given an audio input and its corresponding hypothesis transcript. Our experiments demonstrate that the fine-tuned Whisper-tiny model, comparable in size to a strong CEM baseline, achieves similar performance on the in-domain dataset and surpasses the CEM baseline on eight out-of-domain datasets, whereas the fine-tuned Whisper-large model consistently outperforms the CEM baseline by a substantial margin across all datasets.