SDCLAug 12, 2025

Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization

arXiv:2508.08550v17 citationsh-index: 1ACL
Originality Synthesis-oriented
AI Analysis

This addresses synchronization problems for viewers in video dubbing applications, representing an incremental improvement in a domain-specific area.

The paper tackled the problem of audio-video synchronization issues in video dubbing caused by duration mismatches between source and target speech, proposing the Segment Supervised Preference Optimization (SSPO) method which achieved superior performance in duration alignment tasks.

Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes