Training Neural Speech Recognition Systems with Synthetic Speech Augmentation
This addresses the data scarcity issue in ASR research, enabling more accurate systems, though it is incremental as it builds on existing augmentation techniques.
The paper tackles the problem of limited labeled speech data for automatic speech recognition by augmenting the LibriSpeech dataset with synthetic speech, resulting in state-of-the-art Word Error Rate for character-level models without an external language model.
Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled speech samples produced by a diverse set of speakers. The lack of such open free datasets is one of the main issues preventing advancements in ASR research. To address this problem, we propose to augment a natural speech dataset with synthetic speech. We train very large end-to-end neural speech recognition models using the LibriSpeech dataset augmented with synthetic speech. These new models achieve state of the art Word Error Rate (WER) for character-level based models without an external language model.