SDAIASJul 30, 2025

Adaptive Duration Model for Text Speech Alignment

arXiv:2507.22612v2h-index: 2
Originality Incremental advance
AI Analysis

This work addresses a key bottleneck in non-autoregressive TTS models by providing more precise and adaptable duration predictions, which is incremental but impactful for improving speech synthesis quality.

The paper tackles the problem of phoneme-level duration prediction for speech-to-text alignment in neural TTS models, proposing a novel framework that improves alignment accuracy and enhances zero-shot TTS robustness.

Speech-to-text alignment is a critical component of neural text to speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line, while non-autoregressive end to end TTS models rely on durations extracted from external sources. In this paper, we propose a novel duration prediction framework that can give promising phoneme-level duration distribution with given text. In our experiments, the proposed duration model has more precise prediction and adaptation ability to conditions, compared to previous baseline models. Specifically, it makes a considerable improvement on phoneme-level alignment accuracy and makes the performance of zero-shot TTS models more robust to the mismatch between prompt audio and input audio.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes