CL AI SD ASMay 31, 2025

Length Aware Speech Translation for Video Dubbing

Harveen Singh Chadha, Aswin Shanmugam Subramanian, Vikas Joshi, Shubham Bansal, Jian Xue, Rupeshkumar Mehta, Jinyu Li

arXiv:2506.00740v12.71 citationsh-index: 15INTERSPEECH

Originality Incremental advance

AI Analysis

This addresses synchronization issues in real-time, on-device video dubbing, offering an incremental improvement over existing methods.

The paper tackled the challenge of aligning translated audio with source audio in video dubbing by developing a phoneme-based end-to-end length-sensitive speech translation model and length-aware beam search, which maintained comparable BLEU scores while improving synchronization quality with MOS gains of 0.34 for Spanish and 0.65 for Korean.

In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths short, normal, and long using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively.

View on arXiv PDF

Similar