CLJul 1, 2024

Cross-Lingual Transfer Learning for Speech Translation

arXiv:2407.01130v321 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the challenge of building efficient speech translation systems for multiple languages with restricted data, though it is incremental as it builds on existing models like Whisper.

The paper tackled the problem of expanding speech translation capabilities of multilingual foundation models with limited data, using Whisper as an example, and showed that fine-tuning with only English-to-Chinese data improved translation to Chinese for multiple languages, achieving zero-shot cross-lingual transfer.

There has been increasing interest in building multilingual foundation models for NLP and speech research. This paper examines how to expand the speech translation capability of these models with restricted data. Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model. Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space. This shared embedding space can then be leveraged for zero-shot cross-lingual transfer in speech translation. By fine-tuning the Whisper decoder with only English-to-Chinese speech translation data, improved performance for translation to Chinese can be obtained for multiple languages, in addition to English. Furthermore, for languages related to those seen in training it is possible to perform speech translation, despite the model never seeing the language in training, or being able to perform transcription.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes