SDCVLGMMASJul 1, 2025

MuteSwap: Visual-informed Silent Video Identity Conversion

arXiv:2507.00498v3h-index: 1MM
Originality Highly original
AI Analysis

This addresses the challenge of voice conversion in scenarios with unavailable or noisy audio, such as silent videos, offering a novel solution for applications in multimedia and communication.

The paper tackles the problem of performing voice conversion without audio input by using only visual cues from silent videos, achieving impressive performance in speech synthesis and identity conversion, especially in noisy conditions.

Conventional voice conversion modifies voice characteristics from a source speaker to a target speaker, relying on audio input from both sides. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. In this work, we focus on the task of Silent Face-based Voice Conversion (SFVC), which does voice conversion entirely from visual inputs. i.e., given images of a target speaker and a silent video of a source speaker containing lip motion, SFVC generates speech aligning the identity of the target speaker while preserving the speech content in the source silent video. As this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging. To address this, we introduce MuteSwap, a novel framework that employs contrastive learning to align cross-modality identities and minimize mutual information to separate shared visual features. Experimental results show that MuteSwap achieves impressive performance in both speech synthesis and identity conversion, especially under noisy conditions where methods dependent on audio input fail to produce intelligible results, demonstrating both the effectiveness of our training approach and the feasibility of SFVC.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes