CLFeb 25, 2020

MuST-Cinema: a Speech-to-Subtitles corpus

Alina Karakanta, Matteo Negri, Marco Turchi

arXiv:2002.10829v131.21002 citations

Originality Synthesis-oriented

AI Analysis

This addresses the problem of automating subtitling for audiovisual content localization, facilitating human subtitlers and reducing costs, though it is incremental as it builds on existing data and methods.

The authors tackled the lack of speech-aligned and subtitle-break annotated data for automatic subtitling by creating MuST-Cinema, a multilingual speech translation corpus from TED subtitles, which includes audio, transcription, and translation triplets with preserved subtitle breaks using special symbols, enabling models to efficiently segment sentences into subtitles.

Growing needs in localising audiovisual content in multiple languages through subtitles call for the development of automatic solutions for human subtitling. Neural Machine Translation (NMT) can contribute to the automatisation of subtitling, facilitating the work of human subtitlers and reducing turn-around times and related costs. NMT requires high-quality, large, task-specific training data. The existing subtitling corpora, however, are missing both alignments to the source language audio and important information about subtitle breaks. This poses a significant limitation for developing efficient automatic approaches for subtitling, since the length and form of a subtitle directly depends on the duration of the utterance. In this work, we present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles. The corpus is comprised of (audio, transcription, translation) triplets. Subtitle breaks are preserved by inserting special symbols. We show that the corpus can be used to build models that efficiently segment sentences into subtitles and propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.

View on arXiv PDF

Similar