CLAISDASMay 31, 2025

Length Aware Speech Translation for Video Dubbing

arXiv:2506.00740v11 citationsh-index: 15INTERSPEECH
Originality Incremental advance
AI Analysis

This addresses synchronization issues in real-time, on-device video dubbing, offering an incremental improvement over existing methods.

The paper tackled the challenge of aligning translated audio with source audio in video dubbing by developing a phoneme-based end-to-end length-sensitive speech translation model and length-aware beam search, which maintained comparable BLEU scores while improving synchronization quality with MOS gains of 0.34 for Spanish and 0.65 for Korean.

In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths short, normal, and long using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes