EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
This work addresses the challenge of multilingual adaptation for underrepresented languages, though it is incremental as it builds on existing models and methods.
The authors tackled the problem of limited language coverage in large language models for low-resource languages by introducing EMMA-500, a model continue-trained on 546 languages, which demonstrated significant gains in cross-lingual transfer, task generalization, and language adaptability.
In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability. We release the MaLA corpus, EMMA-500 model weights, scripts, and model generations.