AS SDOct 10, 2021

DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation

Suraj Kothawade, Anmol Mekala, Chandra Sekhara D, Mayank Kothyari, Rishabh Iyer, Ganesh Ramakrishnan, Preethi Jyothi

arXiv:2110.04908v436.9223 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of data-efficient and fair accent adaptation in ASR systems, which is an incremental improvement for speech recognition applications.

The paper tackles the problem of improving ASR performance on specific accents with limited labeled data by proposing DITTO, a method for selecting informative subsets of speech samples for finetuning, which is shown to be 3-5 times more label-efficient than other methods.

State-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring large amounts of labeled speech for specific target accents is challenging. Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and faIr Targeted subseT selectiOn) that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. An important feature of DITTO is that it supports fair targeting for multiple accents, i.e. it can automatically select representative data points from multiple accents when the ASR model needs to perform well on more than one accent. We show that DITTO is 3-5 times more label-efficient than other speech selection methods on the IndicTTS and L2 datasets.

View on arXiv PDF

Similar