CL SD ASJun 4, 2025

Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR

Zheng-Xin Yong, Vineel Pratap, Michael Auli, Jean Maillard

arXiv:2506.04364v14.91 citationsh-index: 13INTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses the problem of building robust ASR systems for diverse accents in low-resource settings, providing practical guidance for data composition, though it is incremental in nature.

The study investigated how speaker count, duration per speaker, and accent diversity in training data affect zero-shot accent robustness in low-resource ASR, finding that increasing speaker count is more beneficial than longer per-speaker audio and that accent diversity offers minimal gains when speaker count is controlled.

To build an automatic speech recognition (ASR) system that can serve everyone in the world, the ASR needs to be robust to a wide range of accents including unseen accents. We systematically study how three different variables in training data -- the number of speakers, the audio duration per each individual speaker, and the diversity of accents -- affect ASR robustness towards unseen accents in a low-resource training regime. We observe that for a fixed number of ASR training hours, it is more beneficial to increase the number of speakers (which means each speaker contributes less) than the number of hours contributed per speaker. We also observe that more speakers enables ASR performance gains from scaling number of hours. Surprisingly, we observe minimal benefits to prioritizing speakers with different accents when the number of speakers is controlled. Our work suggests that practitioners should prioritize increasing the speaker count in ASR training data composition for new languages.

View on arXiv PDF

Similar