Considerations for Ethical Speech Recognition Datasets
This work tackles ethical issues in speech AI for marginalized users, but it is incremental as it builds on existing discussions about dataset bias without introducing new methods.
The paper addresses the problem of speech recognition models performing poorly on users from non-dominant demographics due to biased datasets, advocating for ethical dataset design to improve robustness and inclusivity. It proposes considerations like legal protection, targeted sampling, and sociotechnical design to enhance model utility and user respect.
Speech AI Technologies are largely trained on publicly available datasets or by the massive web-crawling of speech. In both cases, data acquisition focuses on minimizing collection effort, without necessarily taking the data subjects' protection or user needs into consideration. This results to models that are not robust when used on users who deviate from the dominant demographics in the training set, discriminating individuals having different dialects, accents, speaking styles, and disfluencies. In this talk, we use automatic speech recognition as a case study and examine the properties that ethical speech datasets should possess towards responsible AI applications. We showcase diversity issues, inclusion practices, and necessary considerations that can improve trained models, while facilitating model explainability and protecting users and data subjects. We argue for the legal & privacy protection of data subjects, targeted data sampling corresponding to user demographics & needs, appropriate meta data that ensure explainability & accountability in cases of model failure, and the sociotechnical \& situated model design. We hope this talk can inspire researchers \& practitioners to design and use more human-centric datasets in speech technologies and other domains, in ways that empower and respect users, while improving machine learning models' robustness and utility.