Speech Corpora Divergence Based Unsupervised Data Selection for ASR
This addresses data selection for ASR training, offering an unsupervised approach that is incremental over prior methods.
The paper tackles the problem of selecting training data for automatic speech recognition by proposing an unsupervised method based on speech corpora divergence, which measures similarity between corpora using self-supervised models and N-gram distributions. Experiments on Common Voice accents show it achieves a 14.8% relative improvement over random selection and performs comparably to supervised methods.
Selecting application scenarios matching data is important for the automatic speech recognition (ASR) training, but it is difficult to measure the matching degree of the training corpus. This study proposes a unsupervised target-aware data selection method based on speech corpora divergence (SCD), which can measure the similarity between two speech corpora. We first use the self-supervised Hubert model to discretize the speech corpora into label sequence and calculate the N-gram probability distribution. Then we calculate the Kullback-Leibler divergence between the N-grams as the SCD. Finally, we can choose the subset which has minimum SCD to the target corpus for annotation and training. Compared to previous data selection method, the SCD data selection method can focus on more acoustic details and guarantee the diversity of the selected set. We evaluate our method on different accents from Common Voice. Experiments show that the proposed SCD data selection can realize 14.8% relative improvements to the random selection, comparable or even superior to the result of supervised selection.