SDAICLLGASDec 11, 2024

Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration

arXiv:2412.08112v11 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses inefficiencies in TTS systems for generating more natural speech, though it appears incremental as it builds on existing models like FastSpeech.

The paper tackles the problem of text-to-speech models relying on external tools for duration labeling, which can be inefficient and inaccurate, by proposing an aligner-guided training paradigm that trains an aligner first to improve duration accuracy. The result is up to a 16% improvement in word error rate and enhanced phoneme and tone alignment.

Recent advancements in text-to-speech (TTS) systems, such as FastSpeech and StyleSpeech, have significantly improved speech generation quality. However, these models often rely on duration generated by external tools like the Montreal Forced Aligner, which can be time-consuming and lack flexibility. The importance of accurate duration is often underestimated, despite their crucial role in achieving natural prosody and intelligibility. To address these limitations, we propose a novel Aligner-Guided Training Paradigm that prioritizes accurate duration labelling by training an aligner before the TTS model. This approach reduces dependence on external tools and enhances alignment accuracy. We further explore the impact of different acoustic features, including Mel-Spectrograms, MFCCs, and latent features, on TTS model performance. Our experimental results show that aligner-guided duration labelling can achieve up to a 16\% improvement in word error rate and significantly enhance phoneme and tone alignment. These findings highlight the effectiveness of our approach in optimizing TTS systems for more natural and intelligible speech generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes