CLAIJun 28, 2022

On the Impact of Noises in Crowd-Sourced Data for Speech Translation

ByteDanceCMU
arXiv:2206.13756v2640 citationsh-index: 51
Originality Synthesis-oriented
AI Analysis

This work addresses data quality problems in a widely used benchmark for speech translation, which is important for researchers and practitioners in the field, though it is incremental as it focuses on cleaning an existing dataset.

The paper tackled quality issues in the MuST-C speech translation dataset, such as audio-text misalignment and inaccurate translations, by proposing an automatic method to fix or filter these problems, resulting in improved model performance on clean test sets and consistent model rankings across different test sets.

Training speech translation (ST) models requires large and high-quality datasets. MuST-C is one of the most widely used ST benchmark datasets. It contains around 400 hours of speech-transcript-translation data for each of the eight translation directions. This dataset passes several quality-control filters during creation. However, we find that MuST-C still suffers from three major quality issues: audio-text misalignment, inaccurate translation, and unnecessary speaker's name. What are the impacts of these data quality issues for model development and evaluation? In this paper, we propose an automatic method to fix or filter the above quality issues, using English-German (En-De) translation as an example. Our experiments show that ST models perform better on clean test sets, and the rank of proposed models remains consistent across different test sets. Besides, simply removing misaligned data points from the training set does not lead to a better ST model.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes