CLFeb 25, 2023
Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic DubbingAlexandra Chronopoulou, Brian Thompson, Prashant Mathur et al. · amazon-science, apple-ml
Automatic dubbing (AD) is the task of translating the original speech in a video into target language speech. The new target language speech should satisfy isochrony; that is, the new speech should be time aligned with the original video, including mouth movements, pauses, hand gestures, etc. In this paper, we propose training a model that directly optimizes both the translation as well as the speech duration of the generated translations. We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
CLAug 4, 2023
Speaker Diarization of Scripted Audiovisual ContentYogesh Virkar, Brian Thompson, Rohit Paturi et al. · amazon-science, apple-ml
The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language. In particular, the verbatim script (i.e. as-broadcast script) must be structured into a sequence of dialogue lines each including time codes, speaker name and transcript. Current speech recognition technology alleviates the transcription step. However, state-of-the-art speaker diarization models still fall short on TV shows for two main reasons: (i) their inability to track a large number of speakers, (ii) their low accuracy in detecting frequent speaker changes. To mitigate this problem, we present a novel approach to leverage production scripts used during the shooting process, to extract pseudo-labeled data for the speaker diarization task. We propose a novel semi-supervised approach and demonstrate improvements of 51.7% relative to two unsupervised baseline models on our metrics on a 66 show test set.
CLDec 23, 2022
Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic DubbingWilliam Brannon, Yogesh Virkar, Brian Thompson · amazon-science, apple-ml
We investigate how humans perform the task of dubbing video content from one language into another, leveraging a novel corpus of 319.57 hours of video from 54 professionally produced titles. This is the first such large-scale study we are aware of. The results challenge a number of assumptions commonly made in both qualitative literature on human dubbing and machine-learning literature on automatic dubbing, arguing for the importance of vocal naturalness and translation quality over commonly emphasized isometric (character length) and lip-sync constraints, and for a more qualified view of the importance of isochronic (timing) constraints. We also find substantial influence of the source-side audio on human dubs through channels other than the words of the translation, pointing to the need for research on ways to preserve speech characteristics, as well as semantic transfer such as emphasis/emotion, in automatic dubbing systems.
CLApr 6, 2022
Prosodic Alignment for off-screen automatic dubbingYogesh Virkar, Marcello Federico, Robert Enyedi et al. · amazon-science
The goal of automatic dubbing is to perform speech-to-speech translation while achieving audiovisual coherence. This entails isochrony, i.e., translating the original speech by also matching its prosodic structure into phrases and pauses, especially when the speaker's mouth is visible. In previous work, we introduced a prosodic alignment model to address isochrone or on-screen dubbing. In this work, we extend the prosodic alignment model to also address off-screen dubbing that requires less stringent synchronization constraints. We conduct experiments on four dubbing directions - English to French, Italian, German and Spanish - on a publicly available collection of TED Talks and on publicly available YouTube videos. Empirical results show that compared to our previous work the extended prosodic alignment model provides significantly better subjective viewing experience on videos in which on-screen and off-screen automatic dubbing is applied for sentences with speakers mouth visible and not visible, respectively.
CLMay 22, 2023
Improving Isochronous Machine Translation with Target Factors and Auxiliary CountersProyag Pal, Brian Thompson, Yogesh Virkar et al.
To translate speech for automatic dubbing, machine translation needs to be isochronous, i.e. translated speech needs to be aligned with the source in terms of speech durations. We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences. We also introduce auxiliary counters to help the decoder to keep track of the timing information while generating target phonemes. We show that our model improves translation quality and isochrony compared to previous work where the translation model is instead trained to predict interleaved sequences of phonemes and durations.
CLDec 16, 2021
Isometric MT: Neural Machine Translation for Automatic DubbingSurafel M. Lakew, Yogesh Virkar, Prashant Mathur et al.
Automatic dubbing (AD) is among the machine translation (MT) use cases where translations should match a given length to allow for synchronicity between source and target speech. For neural MT, generating translations of length close to the source length (e.g. within +-10% in character count), while preserving quality is a challenging task. Controlling MT output length comes at a cost to translation quality, which is usually mitigated with a two step approach of generating N-best hypotheses and then re-ranking based on length and quality. This work introduces a self-learning approach that allows a transformer model to directly learn to generate outputs that closely match the source length, in short Isometric MT. In particular, our approach does not require to generate multiple hypotheses nor any auxiliary ranking function. We report results on four language pairs (English - French, Italian, German, Spanish) with a publicly available benchmark. Automatic and manual evaluations show that our method for Isometric MT outperforms more complex approaches proposed in the literature.
CLDec 16, 2021
Isochrony-Aware Neural Machine Translation for Automatic DubbingDerek Tam, Surafel M. Lakew, Yogesh Virkar et al.
We introduce the task of isochrony-aware machine translation which aims at generating translations suitable for dubbing. Dubbing of a spoken sentence requires transferring the content as well as the speech-pause structure of the source into the target language to achieve audiovisual coherence. Practically, this implies correctly projecting pauses from the source to the target and ensuring that target speech segments have roughly the same duration of the corresponding source speech segments. In this work, we propose implicit and explicit modeling approaches to integrate isochrony information into neural machine translation. Experiments on English-German/French language pairs with automatic metrics show that the simplest of the considered approaches works best. Results are confirmed by human evaluations of translations and dubbed videos.
CLOct 8, 2021
Machine Translation Verbosity Control for Automatic DubbingSurafel M. Lakew, Marcello Federico, Yue Wang et al.
Automatic dubbing aims at seamlessly replacing the speech in a video document with synthetic speech in a different language. The task implies many challenges, one of which is generating translations that not only convey the original content, but also match the duration of the corresponding utterances. In this paper, we focus on the problem of controlling the verbosity of machine translation output, so that subsequent steps of our automatic dubbing pipeline can generate dubs of better quality. We propose new methods to control the verbosity of MT output and compare them against the state of the art with both intrinsic and extrinsic evaluations. For our experiments we use a public data set to dub English speeches into French, Italian, German and Spanish. Finally, we report extensive subjective tests that measure the impact of MT verbosity control on the final quality of dubbed video clips.