SDAIASSep 24, 2024

Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech

arXiv:2409.16203v12 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the need for adaptive and emotionally nuanced speech synthesis for virtual characters and enhances accessibility for visually impaired users in contexts like webcomics.

The paper tackles the problem of synthesizing emotionally expressive speech from text by incorporating facial images and emotion intensity, proposing FEIM-TTS, a zero-shot TTS model that achieves high-quality, speaker-agnostic speech without labeled datasets.

We propose FEIM-TTS, an innovative zero-shot text-to-speech (TTS) model that synthesizes emotionally expressive speech, aligned with facial images and modulated by emotion intensity. Leveraging deep learning, FEIM-TTS transcends traditional TTS systems by interpreting facial cues and adjusting to emotional nuances without dependence on labeled datasets. To address sparse audio-visual-emotional data, the model is trained using LRS3, CREMA-D, and MELD datasets, demonstrating its adaptability. FEIM-TTS's unique capability to produce high-quality, speaker-agnostic speech makes it suitable for creating adaptable voices for virtual characters. Moreover, FEIM-TTS significantly enhances accessibility for individuals with visual impairments or those who have trouble seeing. By integrating emotional nuances into TTS, our model enables dynamic and engaging auditory experiences for webcomics, allowing visually impaired users to enjoy these narratives more fully. Comprehensive evaluation evidences its proficiency in modulating emotion and intensity, advancing emotional speech synthesis and accessibility. Samples are available at: https://feim-tts.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes