SD AI ASJun 25, 2024

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg

arXiv:2406.17957v120.430 citations

Originality Incremental advance

AI Analysis

This addresses robustness issues in speech synthesis for users of LLM-based TTS, but it is incremental as it builds on existing encoder-decoder transformer models.

The paper tackled the problem of hallucinations and attention errors in LLM-based text-to-speech systems, particularly with repeated tokens, by proposing techniques using CTC loss and attention priors to enforce monotonic alignment, resulting in significantly improved robustness without adding parameters.

Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.

View on arXiv PDF

Similar