CLApr 21, 2021

End-to-end Speech Translation via Cross-modal Progressive Training

arXiv:2104.10380v25.881 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses data scarcity in speech translation for researchers and practitioners, offering an incremental improvement over existing methods.

The paper tackles the challenge of data scarcity in end-to-end speech translation by proposing XSTNet, which leverages unlabeled and parallel bilingual text through multi-task training and progressive procedures, achieving state-of-the-art results with an average BLEU of 28.8 and outperforming the previous best method by 3.2 BLEU.

End-to-end speech translation models have become a new trend in research due to their potential of reducing error propagation. However, these models still suffer from the challenge of data scarcity. How to effectively use unlabeled or other parallel corpora from machine translation is promising but still an open problem. In this paper, we propose Cross Speech-Text Network (XSTNet), an end-to-end model for speech-to-text translation. XSTNet takes both speech and text as input and outputs both transcription and translation text. The model benefits from its three key design aspects: a self-supervised pre-trained sub-network as the audio encoder, a multi-task training objective to exploit additional parallel bilingual text, and a progressive training procedure. We evaluate the performance of XSTNet and baselines on the MuST-C En-X and LibriSpeech En-Fr datasets. In particular, XSTNet achieves state-of-the-art results on all language directions with an average BLEU of 28.8, outperforming the previous best method by 3.2 BLEU. Code, models, cases, and more detailed analysis are available at https://github.com/ReneeYe/XSTNet.

View on arXiv PDF Code

Similar