Rethinking Evaluation in ASR: Are Our Models Robust Enough?
This work addresses the robustness of ASR models for researchers and practitioners, highlighting limitations in current evaluation practices.
The paper investigates whether ASR models trained on a single benchmark generalize well to other datasets, finding that noise augmentation improves cross-domain performance and that average WER across multiple benchmarks correlates with real-world noisy data performance.
Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic modeling are typically evaluated based on performance on a single dataset. While the research community has coalesced around various benchmarks, we set out to understand generalization performance in acoustic modeling across datasets - in particular, if models trained on a single dataset transfer to other (possibly out-of-domain) datasets. We show that, in general, reverberative and additive noise augmentation improves generalization performance across domains. Further, we demonstrate that when a large enough set of benchmarks is used, average word error rate (WER) performance over them provides a good proxy for performance on real-world noisy data. Finally, we show that training a single acoustic model on the most widely-used datasets - combined - reaches competitive performance on both research and real-world benchmarks.