On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition
This work addresses domain mismatch and low-resource issues in ASR by refining synthetic data generation, though it is incremental as it builds on existing TTS and alignment methods.
The study tackled the problem of synthetic data quality for automatic speech recognition by analyzing how phoneme duration variability in text-to-speech outputs affects ASR training, and found that adjusting these durations improved ASR performance in a semi-supervised setting.
Synthetic data generated by text-to-speech (TTS) systems can be used to improve automatic speech recognition (ASR) systems in low-resource or domain mismatch tasks. It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. By using a novel oracle setup we show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive (NAR) TTS. To get reference phoneme durations we use two common alignment methods, a hidden Markov Gaussian-mixture model (HMM-GMM) aligner and a neural connectionist temporal classification (CTC) aligner. Using a simple algorithm based on random walks we shift phoneme duration distributions of the TTS system closer to real durations, resulting in an improvement of an ASR system using synthetic data in a semi-supervised setting.