CLDec 1, 2025

MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

arXiv:2512.01512v12 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of scaling speech-to-text translation to many languages efficiently for users needing multilingual communication, though it is incremental as it builds on existing MLLM methods.

The paper tackles the challenges of limited language coverage and inefficiency in many-to-many speech-to-text translation with MLLMs by proposing the MCAT framework, which scales to 70 languages and reduces speech sequences to 30 tokens, achieving state-of-the-art results on the FLEURS dataset across 70x69 directions with only ~100M trainable parameters and 10 hours of data per language.

Multimodal Large Language Models (MLLMs) have achieved great success in Speech-to-Text Translation (S2TT) tasks. However, current research is constrained by two key challenges: language coverage and efficiency. Most of the popular S2TT datasets are substantially English-centric, which restricts the scaling-up of MLLMs' many-to-many translation capabilities. Moreover, the inference speed of MLLMs degrades dramatically when the speech is converted into long sequences (e.g., 750 tokens). To address these limitations, we propose a Multilingual Cost-effective Accelerated Speech-to-Text Translator (MCAT) framework, which includes two innovations. First, a language scaling method that leverages curriculum learning and a data balancing strategy is introduced to extend the language coverage supported by MLLMs to 70 languages and achieve mutual translation among these languages. Second, an optimized speech adapter module is designed to reduce the length of the speech sequence to only 30 tokens. Extensive experiments were conducted on MLLMs of different scales (9B and 27B). The experimental results demonstrate that MCAT not only surpasses state-of-the-art end-to-end models on the FLEURS dataset across 70x69 directions but also enhances batch inference efficiency. This is achieved with only ~100M trainable parameters and by using only 10 hours of S2TT data per language. Furthermore, we have released MCAT as open-source to promote the development of MLLMs for robust S2TT capabilities. The code and models are released at https://github.com/yxduir/m2m-70.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes