Multilingual Simultaneous Speech Translation
This work addresses the need for efficient real-time translation in multilingual conferences or meetings, though it is incremental as it builds on existing adaptation techniques.
The paper tackled the problem of balancing translation quality and latency in simultaneous speech translation for multilingual settings, showing that adapting end-to-end monolingual models to online use reduces latency by 40% relative across languages and architectures with smaller quality losses in end-to-end systems.
Applications designed for simultaneous speech translation during events such as conferences or meetings need to balance quality and lag while displaying translated text to deliver a good user experience. One common approach to building online spoken language translation systems is by leveraging models built for offline speech translation. Based on a technique to adapt end-to-end monolingual models, we investigate multilingual models and different architectures (end-to-end and cascade) on the ability to perform online speech translation. On the multilingual TEDx corpus, we show that the approach generalizes to different architectures. We see similar gains in latency reduction (40% relative) across languages and architectures. However, the end-to-end architecture leads to smaller translation quality losses after adapting to the online model. Furthermore, the approach even scales to zero-shot directions.