ASR4REAL: An extended benchmark for speech models
This work addresses the problem of limited diversity in existing ASR benchmarks for researchers and developers, highlighting biases and weaknesses in current models, though it is incremental as it extends rather than replaces prior benchmarks.
The authors introduced a new benchmark for speech models to evaluate performance under real-life conditions, revealing significant performance drops based on accent, socio-economic status, and conversational speech, with models showing up to 30% accuracy reduction in some cases.
Popular ASR benchmarks such as Librispeech and Switchboard are limited in the diversity of settings and speakers they represent. We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models. We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent, and even more important ones depending on the socio-economic status of the speakers. Finally, all tested models show a strong performance drop when tested on conversational speech, and in this precise context even a language model trained on a dataset as big as Common Crawl does not seem to have significant positive effect which reiterates the importance of developing conversational language models