AS AI CL CV MM SDNov 23, 2025

SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

Kaidi Wang, Yi He, Wenhao Guan, Weijie Wu, Hongwu Ding, Xiong Zhang, Di Wu, Meng Meng, Jian Luan, Lin Li, Qingyang Hong

arXiv:2512.05126v11.2

Originality Incremental advance

AI Analysis

This addresses video dubbing challenges for multimedia applications, but it appears incremental as it builds on existing TTS models.

The paper tackles video dubbing by proposing SyncVoice, a framework that fine-tunes a pretrained TTS model with audio-visual data to improve speech naturalness and synchronization, achieving high-fidelity results in cross-lingual settings.

Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.

View on arXiv PDF

Similar