LGCLOct 19, 2021

AequeVox: Automated Fairness Testing of Speech Recognition Systems

arXiv:2110.09843v218 citations
Originality Incremental advance
AI Analysis

This work addresses fairness issues in ASR systems, which are critical for equitable technology use in daily life, by providing an automated testing method, though it is incremental as it builds on existing fairness testing approaches.

The paper tackled the problem of fairness in automatic speech recognition (ASR) systems by introducing AequeVox, an automated testing framework that simulates different environments to evaluate fairness across subgroups, revealing that non-native English, female, and Nigerian English speakers generated 109%, 528.5%, and 156.9% more errors, respectively, compared to native English, male, and UK Midlands speakers.

Automatic Speech Recognition (ASR) systems have become ubiquitous. They can be found in a variety of form factors and are increasingly important in our daily lives. As such, ensuring that these systems are equitable to different subgroups of the population is crucial. In this paper, we introduce, AequeVox, an automated testing framework for evaluating the fairness of ASR systems. AequeVox simulates different environments to assess the effectiveness of ASR systems for different populations. In addition, we investigate whether the chosen simulations are comprehensible to humans. We further propose a fault localization technique capable of identifying words that are not robust to these varying environments. Both components of AequeVox are able to operate in the absence of ground truth data. We evaluated AequeVox on speech from four different datasets using three different commercial ASRs. Our experiments reveal that non-native English, female and Nigerian English speakers generate 109%, 528.5% and 156.9% more errors, on average than native English, male and UK Midlands speakers, respectively. Our user study also reveals that 82.9% of the simulations (employed through speech transformations) had a comprehensibility rating above seven (out of ten), with the lowest rating being 6.78. This further validates the fairness violations discovered by AequeVox. Finally, we show that the non-robust words, as predicted by the fault localization technique embodied in AequeVox, show 223.8% more errors than the predicted robust words across all ASRs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes