CLFeb 12

Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

Yuzhe Shang, Pengzhi Gao, Wei Liu, Jian Luan, Jinsong Su

arXiv:2602.11961v22 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of high-quality multilingual translation for users needing open-source alternatives, though it is incremental as it builds on existing LLM frameworks.

The paper tackled multilingual machine translation by scaling model and data for open large language models, resulting in MiLMMT-46 achieving top-tier performance across 46 languages and outperforming recent state-of-the-art models.

Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro. Models are released at https://huggingface.co/collections/xiaomi-research/milmmt-46. Codes are released at https://github.com/xiaomi-research/gemmax.

View on arXiv PDF Code

Similar