ASCLSDJun 21, 2023

Visual-Aware Text-to-Speech

arXiv:2306.12020v11 citationsh-index: 55
Originality Incremental advance
AI Analysis

This addresses the need for more adaptive and natural human-computer interaction systems, though it is incremental as it builds on traditional text-to-speech by adding visual awareness.

The paper tackles the problem of synthesizing speech that responds to a listener's visual feedback in face-to-face interactions, resulting in more natural audio with appropriate rhythm and prosody as verified on the ViCo-X dataset.

Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and sequential visual feedback (e.g., nod, smile) of the listener in face-to-face communication. Different from traditional text-to-speech, VA-TTS highlights the impact of visual modality. On this newly-minted task, we devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis. Extensive experiments on multimodal conversation dataset ViCo-X verify our proposal for generating more natural audio with scenario-appropriate rhythm and prosody.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes