CLAIJul 13, 2024

Bilingual Adaptation of Monolingual Foundation Models

arXiv:2407.12869v26 citationsh-index: 47
Originality Incremental advance
AI Analysis

This addresses the challenge of cross-lingual transfer for AI applications in non-English languages, though it is incremental as it builds on existing adaptation methods.

The paper tackles the problem of adapting monolingual large language models to new languages, focusing on adapting Llama 2 to Arabic, resulting in significant improvements in Arabic and slight enhancements in English.

We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pre-training on a bilingual corpus. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic. Our approach results in significant improvements in Arabic and slight enhancements in English, demonstrating cost-effective cross-lingual transfer. We perform ablations on embedding initialization techniques, data mix ratios, and learning rates and release a detailed training recipe. To demonstrate generalizability of this approach we also adapted Llama 3 8B to Arabic and Llama 2 13B to Hindi.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes