CLSDASMay 27, 2025

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

arXiv:2505.20899v11 citationsh-index: 9EMNLP
Originality Incremental advance
AI Analysis

This addresses the need for seamless dubbing in media applications by overcoming mismatches in speech patterns, though it is incremental as it builds on existing translation approaches.

The paper tackled the problem of speech-to-speech translation for dubbing by preserving duration, speaker identity, and speaking speed, resulting in natural and fluent translations that align with the original speech's characteristics while achieving competitive translation performance.

This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the predicted units and source identity with a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech's duration and speaking pace, while achieving competitive translation performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes