CLAILGApr 8, 2024

SambaLingo: Teaching Large Language Models New Languages

arXiv:2404.05829v224 citationsh-index: 17MRL
Originality Incremental advance
AI Analysis

This work addresses the gap in LLM capabilities for diverse languages, which is crucial for improving AI accessibility globally, though it is incremental as it builds on existing adaptation methods.

The paper tackles the problem of adapting large language models to new languages by investigating best practices for vocabulary extension, direct preference optimization, and handling data scarcity in low-resource settings, achieving state-of-the-art performance across 9 languages and outperforming all prior published baselines.

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes