SD AI ASFeb 27, 2025

DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models

Weihao wu, Zhiwei Lin, Yixuan Zhou, Jingbei Li, Rui Niu, Qinghua Wu, Songjun Cao, Long Ma, Zhiyong Wu

arXiv:2502.19924v19.32 citationsh-index: 10ICASSP

Originality Incremental advance

AI Analysis

This work addresses the problem of generating varied and natural conversational speech for applications like virtual assistants, though it appears incremental as it builds on existing diffusion and LM-based methods.

The paper tackled the limitations of conversational speech synthesis systems, which were deterministic and lacked diversity, by proposing DiffCSS, a framework using diffusion models and a language model-based TTS backbone to generate diverse and expressive speech, achieving improved diversity, contextual coherence, and expressiveness in experiments.

Conversational speech synthesis (CSS) aims to synthesize both contextually appropriate and expressive speech, and considerable efforts have been made to enhance the understanding of conversational context. However, existing CSS systems are limited to deterministic prediction, overlooking the diversity of potential responses. Moreover, they rarely employ language model (LM)-based TTS backbones, limiting the naturalness and quality of synthesized speech. To address these issues, in this paper, we propose DiffCSS, an innovative CSS framework that leverages diffusion models and an LM-based TTS backbone to generate diverse, expressive, and contextually coherent speech. A diffusion-based context-aware prosody predictor is proposed to sample diverse prosody embeddings conditioned on multimodal conversational context. Then a prosody-controllable LM-based TTS backbone is developed to synthesize high-quality speech with sampled prosody embeddings. Experimental results demonstrate that the synthesized speech from DiffCSS is more diverse, contextually coherent, and expressive than existing CSS systems

View on arXiv PDF

Similar