CL AI ASJul 3, 2023

ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

Yujia Xiao, Shaofei Zhang, Xi Wang, Xu Tan, Lei He, Sheng Zhao, Frank K. Soong, Tan Lee

arXiv:2307.00782v21.39 citationsh-index: 58

Originality Incremental advance

AI Analysis

This work addresses the problem of producing natural and efficient long-form speech synthesis for applications like audiobooks or reading assistants, representing an incremental advancement over existing sentence-level TTS methods.

The paper tackles the problem of generating high-quality, expressive speech for paragraph reading in Text-to-Speech systems by addressing challenges in cross-sentence context and computational efficiency, resulting in significant improvements in voice quality and prosody expressiveness with competitive model efficiency.

While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesis. To address these issues, this work develops a lightweight yet effective TTS system, ContextSpeech. Specifically, we first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. Then we construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Additionally, we integrate linearized self-attention to improve model efficiency. Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency. Audio samples are available at: https://contextspeech.github.io/demo/

View on arXiv PDF

Similar