CLAILGSep 17, 2025

Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

arXiv:2509.14008v14 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses the need for specialized Arabic NLP models, which is a domain-specific problem for researchers and practitioners in Arabic language processing, though it is incremental as it builds on existing translation and fine-tuning techniques.

The authors tackled the problem of creating high-quality Arabic instruction and translation models by developing Hala, a family of Arabic-centric models trained using a translate-and-tune pipeline that compresses a teacher model to FP8 for efficiency and generates a million-scale Arabic instruction corpus. The result was state-of-the-art performance on Arabic-centric benchmarks in both nano (≤2B) and small (7-9B) parameter categories, with specific gains such as ~2× higher throughput without quality loss.

We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR$\leftrightarrow$EN teacher to FP8 (yielding $\sim$2$\times$ higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the "nano" ($\leq$2B) and "small" (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes