CLSDASMay 24, 2023

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

arXiv:2305.14838v221 citations
Originality Incremental advance
AI Analysis

This work addresses data and efficiency bottlenecks in end-to-end speech-to-text translation, offering a practical solution for multilingual applications.

The paper tackles the challenge of joint speech-language training by introducing ComSL, a composite model that efficiently combines pretrained speech and language models, achieving a new state-of-the-art average BLEU score of 31.5 on multilingual speech-to-text translation across 21 languages.

Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes