AS CLDec 19, 2024

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition

Niko Moritz, Ruiming Xie, Yashesh Gaur, Ke Li, Simone Merello, Zeeshan Ahmed, Frank Seide, Christian Fuegen

arXiv:2412.15415v13.32 citationsh-index: 23

Originality Incremental advance

AI Analysis

This addresses the need for efficient real-time speech processing in bilingual conversations, though it is incremental as it builds on existing transducer-based and multi-objective approaches.

The paper tackles the problem of simultaneous end-to-end automatic speech recognition and speech translation by proposing the JSTAR model, which achieves superior BLEU scores and latency compared to a strong cascaded model in a bilingual conversational setting with smart-glasses.

We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is transducer-based and uses a multi-objective training strategy that optimizes both ASR and ST objectives simultaneously. This allows JSTAR to produce high-quality streaming ASR and ST results. We apply JSTAR in a bilingual conversational speech setting with smart-glasses, where the model is also trained to distinguish speech from different directions corresponding to the wearer and a conversational partner. Different model pre-training strategies are studied to further improve results, including training of a transducer-based streaming machine translation (MT) model for the first time and applying it for parameter initialization of JSTAR. We demonstrate superior performances of JSTAR compared to a strong cascaded ST model in both BLEU scores and latency.

View on arXiv PDF

Similar