Prosodic Boundary-Aware Streaming Generation for LLM-Based TTS with Streaming Text Input
This work provides a robust solution for streaming TTS with incremental text input, benefiting interactive systems that require real-time speech generation.
This paper addresses the challenges of streaming Text-to-Speech (TTS) for interactive systems, specifically unnatural prosody and long-form collapse. The authors propose a prosodic-boundary-aware post-training strategy for LLM-based TTS, which significantly reduces word error rate by 66.2% (from 71.0% to 4.8%) and improves speaker and emotion similarity by 16.1% and 1.5% respectively in long-text synthesis.
Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, adapting a pretrained LLM-based TTS model using weakly time-aligned data. Specifically, the model is adapted to learn early stopping at specified content boundaries when provided with limited future text. During inference, a sliding-window prompt carries forward previous text and speech tokens, ensuring bounded context and seamless concatenation. Evaluations show our method outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, especially, it achieves a 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion similarity by 16.1% and 1.5% relatively, offering a robust solution for streaming TTS with incremental text.