SDAICLLGOct 9, 2021

PAMA-TTS: Progression-Aware Monotonic Attention for Stable Seq2Seq TTS With Accurate Phoneme Duration Control

arXiv:2110.04486v2
AI Analysis

This addresses the problem of generating stable, natural-sounding speech with accurate phoneme durations for TTS applications, representing an incremental improvement over existing attention and duration-based methods.

The paper tackled the challenge of unstable attention and poor duration control in sequence-to-sequence text-to-speech (TTS) by proposing PAMA-TTS, which integrates monotonic attention with explicit duration models using token duration and countdown information, resulting in the highest naturalness and on-par or better duration controllability compared to duration-informed models.

Sequence expansion between encoder and decoder is a critical challenge in sequence-to-sequence TTS. Attention-based methods achieve great naturalness but suffer from unstable issues like missing and repeating phonemes, not to mention accurate duration control. Duration-informed methods, on the contrary, seem to easily adjust phoneme duration but show obvious degradation in speech naturalness. This paper proposes PAMA-TTS to address the problem. It takes the advantage of both flexible attention and explicit duration models. Based on the monotonic attention mechanism, PAMA-TTS also leverages token duration and relative position of a frame, especially countdown information, i.e. in how many future frames the present phoneme will end. They help the attention to move forward along the token sequence in a soft but reliable control. Experimental results prove that PAMA-TTS achieves the highest naturalness, while has on-par or even better duration controllability than the duration-informed model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes