CLAIJul 22, 2024

ALLaM: Large Language Models for Arabic and English

arXiv:2407.15390v172 citationsh-index: 24
Originality Incremental advance
AI Analysis

This work addresses the need for high-quality language models in Arabic, benefiting the Arabic Language Technologies ecosystem, though it is incremental as it builds on existing multilingual training methods.

The authors tackled the problem of developing large language models for Arabic while maintaining English proficiency, achieving state-of-the-art performance in Arabic benchmarks like MMLU Arabic and ACVA.

We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes