CLSep 29, 2024

Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning

Yexing Du, Youcheng Pan, Ziyang Ma, Bo Yang, Yifan Yang, Keqi Deng, Xie Chen, Yang Xiang, Ming Liu, Bing Qin

arXiv:2409.19510v210.014 citationsh-index: 28Has Code

Originality Incremental advance

AI Analysis

This addresses the scarcity of parallel data for many-to-many translation tasks, enabling effective low-resource learning, though it is incremental as it builds on existing multimodal LLM capabilities.

The paper tackled the problem of limited parallel data for many-to-many speech-to-text translation by proposing a curriculum learning strategy, achieving state-of-the-art average performance in 15x14 language pairs with fewer than 10 hours of speech data per language.

Multimodal Large Language Models (MLLMs) have achieved significant success in Speech-to-Text Translation (S2TT) tasks. While most existing research has focused on English-centric translation directions, the exploration of many-to-many translation is still limited by the scarcity of parallel data. To address this, we propose a three-stage curriculum learning strategy that leverages the machine translation capabilities of large language models and adapts them to S2TT tasks, enabling effective learning in low-resource settings. We trained MLLMs with varying parameter sizes (3B, 7B, and 32B) and evaluated the proposed strategy using the FLEURS and CoVoST-2 datasets. Experimental results show that the proposed strategy achieves state-of-the-art average performance in $15\times14$ language pairs, requiring fewer than 10 hours of speech data per language to achieve competitive results. The source code and models are released at https://github.com/yxduir/LLM-SRT.

View on arXiv PDF Code

Similar