CLMay 18

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$Δ$ Integration into Upcycled MoE

Hao Zhou, Tianhao Li, Zhijun Wang, Shuaijie She, Linjuan Wu, Hao-Ran Wei, Baosong Yang, Jiajun Chen, Shujian Huang

arXiv:2605.1808389.2

Predicted impact top 35% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners aiming to multilingualize LLMs efficiently, this method offers a data-efficient alternative to extensive continued pre-training and alignment, though it is incremental over existing MoE and delta-merging techniques.

The paper tackles the problem of expanding LLMs to new languages while preserving original capabilities. It introduces a method that upcycles a dense model into a MoE architecture and grafts a post-training parameter delta, achieving improved performance on new languages without costly alignment, outperforming baselines with similar FLOPs or parameters.

Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~($Δ_{\text{post}}$) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method's superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.

View on arXiv PDF

Similar