Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis
This addresses the challenge of producing more natural-sounding conversational speech for applications like virtual assistants, though it is incremental as it builds on existing multimodal interaction modeling.
The paper tackles the problem of generating speech with natural prosody in conversational settings by modeling fine-grained word-level interactions in multimodal dialogue history, and the result is that their proposed MFCIG-CSS system outperforms baseline models in prosodic expressiveness on the DailyTalk dataset.
Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosodic interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosodic expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/MFCIG-CSS.