End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs
This work addresses speech translation for multilingual communication, presenting an incremental improvement by combining existing foundational models.
The paper tackled the problem of speech translation by integrating pre-trained speech encoders and large language models into an end-to-end architecture for simultaneous automatic speech recognition and speech translation, achieving up to an 8% gain in COMET-DA22 score compared to SeamlessM4T and matching a cascaded system with Whisper and NLLB for English-to-German translation.
Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $\text{COMET}^{\text{DA}}_{22}$ metric.