CL AIJul 8, 2024

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Yinquan Lu, Wenhao Zhu, Lei Li, Yu Qiao, Fei Yuan

CMU

arXiv:2407.05975v222.970 citationsh-index: 60Has Code

Originality Incremental advance

AI Analysis

This work addresses translation capabilities for over 100 languages, including low-resource ones, but is incremental as it builds on existing LLaMA models with enhanced training strategies.

The paper tackles the problem of low-resource language translation in LLMs by conducting multilingual continual pre-training on LLaMA models, resulting in LLaMAX, which achieves over 10 spBLEU points higher translation performance than existing open-source LLMs and matches specialized models on the Flores-101 benchmark.

Large Language Models (LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we conduct extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs (by more than 10 spBLEU points) and performs on-par with specialized translation model (M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code \footnote{\url{https://github.com/CONE-MT/LLaMAX/.}} and the models \footnote{\url{https://huggingface.co/LLaMAX/.}} are publicly available.

View on arXiv PDF Code

Similar