CLAIJul 8, 2024

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

CMU
arXiv:2407.05975v269 citationsh-index: 60Has Code
AI Analysis

This work addresses translation capabilities for over 100 languages, including low-resource ones, but is incremental as it builds on existing LLaMA models with enhanced training strategies.

The paper tackles the problem of low-resource language translation in LLMs by conducting multilingual continual pre-training on LLaMA models, resulting in LLaMAX, which achieves over 10 spBLEU points higher translation performance than existing open-source LLMs and matches specialized models on the Flores-101 benchmark.

Large Language Models (LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we conduct extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs (by more than 10 spBLEU points) and performs on-par with specialized translation model (M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code \footnote{\url{https://github.com/CONE-MT/LLaMAX/.}} and the models \footnote{\url{https://huggingface.co/LLaMAX/.}} are publicly available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes