ASCLSDMay 30, 2023

Towards Selection of Text-to-speech Data to Augment ASR Training

arXiv:2306.00998v15 citations
Originality Incremental advance
AI Analysis

This work addresses data efficiency in ASR training for speech recognition systems, presenting an incremental improvement over baseline methods.

The paper tackles the problem of selecting synthetic speech data from a large TTS dataset to augment ASR training, finding that incorporating dissimilar samples improves recognition performance and reducing TTS data to below 30% maintains accuracy compared to using all data.

This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic speech recognition (ASR) model. We trained a neural network, which can be optimised using cross-entropy loss or Arcface loss, to measure the similarity of a synthetic data to real speech. We found that incorporating synthetic samples with considerable dissimilarity to real speech, owing in part to lexical differences, into ASR training is crucial for boosting recognition performance. Experimental results on Librispeech test sets indicate that, in order to maintain the same speech recognition accuracy as when using all TTS data, our proposed solution can reduce the size of the TTS data down below its $30\,\%$, which is superior to several baseline methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes