CVNov 7, 2025

Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

arXiv:2511.05432v1h-index: 12
Originality Incremental advance
AI Analysis

This work addresses the challenge of audio-visual synthesis for applications like virtual avatars, but it is incremental as it builds on existing TTS and face generation techniques.

The paper tackles the problem of generating synchronized talking faces from text by using a shared latent speech representation, achieving improved lip-sync and visual realism compared to cascaded methods.

We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes