CLAIMar 20, 2022

STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation

ByteDanceCMU
arXiv:2203.10426v1671 citationsh-index: 60
Originality Incremental advance
AI Analysis

This work addresses the challenge of cross-modal representation discrepancy in speech translation, which is an incremental improvement for the speech-to-text translation domain.

The paper tackles the problem of learning better speech representations for end-to-end speech-to-text translation with limited labeled data by proposing the STEMM method to calibrate cross-modal representation discrepancy, achieving significant improvements over a strong baseline on eight translation directions in experiments on the MuST-C benchmark.

How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a self-learning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes