CLLGSDASOct 29, 2024

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

arXiv:2410.22179v212 citationsh-index: 18NAACL
Originality Highly original
AI Analysis

This addresses robustness and length generalization in text-to-speech for applications requiring long or variable-length utterances, representing a strong specific improvement rather than a foundational change.

The paper tackled the problem of autoregressive Transformer-based text-to-speech models struggling with long sequences, leading to dropped or repeated words, by introducing an alignment mechanism that provides relative location information. The result was a system called Very Attentive Tacotron that matched baseline naturalness while eliminating these issues and enabling generalization to any practical utterance length.

Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backpropagation and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes