SDCVASMay 28, 2025

RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

arXiv:2505.22024v11 citationsh-index: 3INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the challenge of accurate and natural speech synthesis from visual cues for applications like assistive technologies, but it is incremental as it builds on existing methods with a novel decomposition approach.

The paper tackles the problem of reconstructing speech from silent videos by proposing RESOUND, a lip-to-speech system that uses acoustic-semantic decomposed modeling to generate intelligible and expressive speech, achieving effectiveness across various metrics on two standard benchmarks.

Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness due to limited supervision in capturing linguistic content, accents, and prosody. In this paper, we propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos. Leveraging source-filter theory, our method involves two components: an acoustic path to predict prosody and a semantic path to extract linguistic features. This separation simplifies learning, allowing independent optimization of each representation. Additionally, we enhance performance by integrating speech units, a proven unsupervised speech representation technique, into waveform generation alongside mel-spectrograms. This allows RESOUND to synthesize prosodic speech while preserving content and speaker identity. Experiments conducted on two standard L2S benchmarks confirm the effectiveness of the proposed method across various metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes