VisualSpeech: Enhancing Prosody Modeling in TTS Using Video
This addresses the problem of limited prosody variation in TTS for applications with available video, but it appears incremental as it builds on existing methods by adding visual input.
The paper tackles the challenge of generating varied prosody in text-to-speech synthesis by integrating visual context from video, proposing the VisualSpeech model, and reports that incorporating visual features improves prosodic modeling and expressiveness.
Text-to-Speech (TTS) synthesis faces the inherent challenge of producing multiple speech outputs with varying prosody given a single text input. While previous research has addressed this by predicting prosodic information from both text and speech, additional contextual information, such as video, remains under-utilized despite being available in many applications. This paper investigates the potential of integrating visual context to enhance prosody prediction. We propose a novel model, VisualSpeech, which incorporates visual and textual information for improving prosody generation in TTS. Empirical results indicate that incorporating visual features improves prosodic modeling, enhancing the expressiveness of the synthesized speech. Audio samples are available at https://ariameetgit.github.io/VISUALSPEECH-SAMPLES/.