CLSDASJun 2, 2023

Speech Translation with Foundation Models and Optimal Transport: UPC at IWSLT23

arXiv:2306.01327v1222 citationsh-index: 35
Originality Incremental advance
AI Analysis

This work addresses speech-to-text translation for multilingual applications, presenting an incremental improvement with specific gains on benchmark datasets.

The paper tackles speech translation by using foundation models and optimal transport to adapt speech representations to text model spaces, achieving BLEU scores of 31.2 on MuST-C tst-COMMON, 29.8 on IWSLT.tst2020, and 33.4 on IWSLT.ACLdev2023.

This paper describes the submission of the UPC Machine Translation group to the IWSLT 2023 Offline Speech Translation task. Our Speech Translation systems utilize foundation models for speech (wav2vec 2.0) and text (mBART50). We incorporate a Siamese pretraining step of the speech and text encoders with CTC and Optimal Transport, to adapt the speech representations to the space of the text model, thus maximizing transfer learning from MT. After this pretraining, we fine-tune our system end-to-end on ST, with Cross Entropy and Knowledge Distillation. Apart from the available ST corpora, we create synthetic data with SegAugment to better adapt our models to the custom segmentations of the IWSLT test sets. Our best single model obtains 31.2 BLEU points on MuST-C tst-COMMON, 29.8 points on IWLST.tst2020 and 33.4 points on the newly released IWSLT.ACLdev2023.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes