CLLGApr 12, 2021

Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

arXiv:2104.05379v313 citations
Originality Incremental advance
AI Analysis

This work addresses over-fitting in low-resource ASR for researchers, but it is incremental as it extends known synthetic data methods to compare architectures.

The paper tackled the problem of over-fitting in attention encoder-decoder (AED) ASR architectures in low-resource scenarios by investigating synthetic data generation with TTS systems, achieving up to 38% relative improvement with internal language model subtraction, and showed that hybrid systems outperform AED systems on LibriSpeech-100h with a final word-error-rate of 3.3%/10.0%.

Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems, but only very limited in the context of other ASR architectures. We investigate the effect of varying pre-processing, the speaker embedding and input encoding of the TTS system w.r.t. the effectiveness of the synthesized data for AED-ASR training. Additionally, we also consider internal language model subtraction for the first time, resulting in up to 38% relative improvement. We compare the AED results to a state-of-the-art hybrid ASR system, a monophone based system using connectionist-temporal-classification (CTC) and a monotonic transducer based system. We show that for the later systems the addition of synthetic data has no relevant effect, but they still outperform the AED systems on LibriSpeech-100h. We achieve a final word-error-rate of 3.3%/10.0% with a hybrid system on the clean/noisy test-sets, surpassing any previous state-of-the-art systems on Librispeech-100h that do not include unlabeled audio data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes