SDAIASJun 25, 2024

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

arXiv:2406.17957v130 citations
Originality Incremental advance
AI Analysis

This addresses robustness issues in speech synthesis for users of LLM-based TTS, but it is incremental as it builds on existing encoder-decoder transformer models.

The paper tackled the problem of hallucinations and attention errors in LLM-based text-to-speech systems, particularly with repeated tokens, by proposing techniques using CTC loss and attention priors to enforce monotonic alignment, resulting in significantly improved robustness without adding parameters.

Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes