ASLGFeb 19, 2025

Adopting Whisper for Confidence Estimation

arXiv:2502.13446v13 citationsh-index: 2ICASSP
Originality Incremental advance
AI Analysis

This work addresses confidence estimation for speech recognition systems, offering a novel method that improves out-of-domain robustness, though it is incremental as it builds on existing Whisper models.

The paper tackles the problem of word-level confidence estimation for speech recognition by proposing an end-to-end approach that fine-tunes the Whisper model to generate confidence scores, achieving similar performance to baseline methods on in-domain data and outperforming them on out-of-domain datasets, with the Whisper-large model showing substantial gains.

Recent research on word-level confidence estimation for speech recognition systems has primarily focused on lightweight models known as Confidence Estimation Modules (CEMs), which rely on hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate word-level confidence scores. Specifically, we introduce a method in which the Whisper model is fine-tuned to produce scalar confidence scores given an audio input and its corresponding hypothesis transcript. Our experiments demonstrate that the fine-tuned Whisper-tiny model, comparable in size to a strong CEM baseline, achieves similar performance on the in-domain dataset and surpasses the CEM baseline on eight out-of-domain datasets, whereas the fine-tuned Whisper-large model consistently outperforms the CEM baseline by a substantial margin across all datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes