ASAICLCVMMSDNov 23, 2025

SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

arXiv:2512.05126v1
Originality Incremental advance
AI Analysis

This addresses video dubbing challenges for multimedia applications, but it appears incremental as it builds on existing TTS models.

The paper tackles video dubbing by proposing SyncVoice, a framework that fine-tunes a pretrained TTS model with audio-visual data to improve speech naturalness and synchronization, achieving high-fidelity results in cross-lingual settings.

Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes