SDCLASOct 31, 2022

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Microsoft
arXiv:2210.17027v118 citationsh-index: 102
Originality Highly original
AI Analysis

This addresses the data scarcity issue for researchers and practitioners in speech-to-speech translation, offering an incremental improvement over existing methods.

The paper tackles the data scarcity problem in direct speech-to-speech translation by proposing a model jointly pre-trained with unpaired speech and bilingual text, achieving an improvement of about 5 BLEU scores over encoder-only pre-training models and competitive performance with state-of-the-art models.

Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare. To address this issue, we propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks. By effectively leveraging the paired text data, Speech2S is capable of modeling the cross-lingual speech conversion from source to target language. We verify the performance of the proposed Speech2S on Europarl-ST and VoxPopuli datasets. Experimental results demonstrate that Speech2S gets an improvement of about 5 BLEU scores compared to encoder-only pre-training models, and achieves a competitive or even better performance than existing state-of-the-art models1.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes