Adaptive Duration Model for Text Speech Alignment
This work addresses a key bottleneck in non-autoregressive TTS models by providing more precise and adaptable duration predictions, which is incremental but impactful for improving speech synthesis quality.
The paper tackles the problem of phoneme-level duration prediction for speech-to-text alignment in neural TTS models, proposing a novel framework that improves alignment accuracy and enhances zero-shot TTS robustness.
Speech-to-text alignment is a critical component of neural text to speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line, while non-autoregressive end to end TTS models rely on durations extracted from external sources. In this paper, we propose a novel duration prediction framework that can give promising phoneme-level duration distribution with given text. In our experiments, the proposed duration model has more precise prediction and adaptation ability to conditions, compared to previous baseline models. Specifically, it makes a considerable improvement on phoneme-level alignment accuracy and makes the performance of zero-shot TTS models more robust to the mismatch between prompt audio and input audio.