Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics
This provides a benchmark for researchers working on robust ASR, but it is incremental as it focuses on evaluating existing models rather than proposing new methods.
The authors tackled the problem of evaluating ASR robustness to room acoustics by introducing Whisper-RIR-Mega, a benchmark dataset of paired clean and reverberant speech, and found that reverberation degrades performance across Whisper models, with word error rate penalties ranging from 2.31 to 15.50 percentage points.
We introduce Whisper-RIR-Mega, a benchmark dataset of paired clean and reverberant speech for evaluating automatic speech recognition (ASR) robustness to room acoustics. Each sample pairs a clean LibriSpeech utterance with the same utterance convolved with a real room impulse response from the RIR-Mega corpus, with stratified splits by reverberation time (RT60) and direct-to-reverberant ratio (DRR). We evaluate five Whisper models (tiny through large-v3) on 1600 test samples and report word error rate (WER) and character error rate (CER) under clean and reverberant conditions. Reverberation consistently degrades performance across all model sizes; the reverb penalty in WER ranges from 2.31 to 15.50 percentage points depending on the model. Whisper-large-v3 shows the smallest penalty; Whisper-tiny shows the largest. We release the dataset, evaluation code, and baseline results to support reproducible research on robust ASR.