CL SD ASJul 8, 2025

How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures

Tanvina Patel, Wiebke Hutiri, Aaron Yi Ding, Odette Scharenborg

arXiv:2507.05885v19.64 citationsh-index: 12

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of how to effectively measure performance and bias in ASR systems for researchers and practitioners, but it is incremental as it focuses on comparing existing and proposed measures rather than introducing a new paradigm.

The study compared various performance and bias measures for evaluating state-of-the-art end-to-end automatic speech recognition (ASR) systems for Dutch, finding that averaged error rates alone are insufficient and should be supplemented by other measures to better represent system performance and bias across diverse speaker groups.

There is increasingly more evidence that automatic speech recognition (ASR) systems are biased against different speakers and speaker groups, e.g., due to gender, age, or accent. Research on bias in ASR has so far primarily focused on detecting and quantifying bias, and developing mitigation approaches. Despite this progress, the open question is how to measure the performance and bias of a system. In this study, we compare different performance and bias measures, from literature and proposed, to evaluate state-of-the-art end-to-end ASR systems for Dutch. Our experiments use several bias mitigation strategies to address bias against different speaker groups. The findings reveal that averaged error rates, a standard in ASR research, alone is not sufficient and should be supplemented by other measures. The paper ends with recommendations for reporting ASR performance and bias to better represent a system's performance for diverse speaker groups, and overall system bias.

View on arXiv PDF

Similar