CLSep 26, 2024

Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

arXiv:2409.17912v223 citationsh-index: 57
Originality Incremental advance
AI Analysis

This work addresses the neglect of low-resource languages like Moroccan Arabic in contemporary LLMs, offering design methodologies for instruction-tuning, though it is incremental as it adapts existing methods to a specific domain.

The authors tackled the problem of adapting large language models for low-resource Moroccan Arabic dialect by introducing Atlas-Chat, a collection of models fine-tuned on a constructed instruction dataset, resulting in models that outperform state-of-the-art and Arabic-specialized LLMs, e.g., a 9B model gained a 13% performance boost over a larger 13B model on DarijaMMLU.

We introduce Atlas-Chat, the first-ever collection of LLMs specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-2B, 9B, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., our 9B model gains a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource languages, which are often neglected in favor of data-rich languages by contemporary LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes