CLAIMay 20, 2025

GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data

arXiv:2505.17082v12 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of language inclusivity for Moroccan Arabic speakers by providing a low-resource, efficient method to enhance LLM performance in Darija without sacrificing general capabilities.

The paper tackled the marginalization of Moroccan Arabic (Darija) in open-source LLMs by developing GemMaroc, a model trained with minimal data that improves Darija proficiency while preserving reasoning skills, achieving gains like lifting DarijaMMLU from 32.8 to 47.5 for a 4B model and matching or outperforming benchmarks on Darija tasks with minimal compute.

Open-source large language models (LLMs) still marginalise Moroccan Arabic (Darija), forcing practitioners either to bolt on heavyweight Arabic adapters or to sacrifice the very reasoning skills that make LLMs useful. We show that a rigorously quality-over-quantity alignment strategy can surface fluent Darija while safeguarding the backbone s cross-lingual reasoning at a sliver of the usual compute. We translate three compact instruction suites LIMA 1 K, DEITA 6 K and TULU 50 K into Darija, preserve 20 of the English originals, and add mathematics, coding and scientific prompts. A LoRA-tuned Gemma 3-4B trained on 5 K mixed instructions lifts DarijaMMLU from 32.8 to 42.7 ; adding the reasoning-dense TULU portion pushes it to 47.5 with no English regression. Scaling the identical recipe to Gemma 3-27B produces GemMaroc-27B, which matches Atlas-Chat on DarijaMMLU (61.6 ) and leaps ahead on Darija commonsense, scoring 60.5 on HellaSwag versus Atlas-Chat s 48.4 . Crucially, GemMaroc retains Gemma-27B s strong maths and general-reasoning ability, showing only minimal movement on GSM8K and English benchmarks. The entire model is trained in just 48 GPU.h, underscoring a Green AI pathway to inclusive, sustainable language technology. We release code, data and checkpoints to spur Darija-centric applications in education, public services and everyday digital interaction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes