ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading
This work addresses the problem of producing natural and efficient long-form speech synthesis for applications like audiobooks or reading assistants, representing an incremental advancement over existing sentence-level TTS methods.
The paper tackles the problem of generating high-quality, expressive speech for paragraph reading in Text-to-Speech systems by addressing challenges in cross-sentence context and computational efficiency, resulting in significant improvements in voice quality and prosody expressiveness with competitive model efficiency.
While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesis. To address these issues, this work develops a lightweight yet effective TTS system, ContextSpeech. Specifically, we first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. Then we construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Additionally, we integrate linearized self-attention to improve model efficiency. Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency. Audio samples are available at: https://contextspeech.github.io/demo/