CL LGMar 18, 2025

Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM

Yazeed Alnumay, Alexandre Barbet, Anna Bialas, William Darling, Shaan Desai, Joan Devassy, Kyle Duffy, Stephanie Howe, Olivia Lasche, Justin Lee, Anirudh Shrinivason, Jennifer Tracey

arXiv:2503.14603v113.913 citationsh-index: 5Proceedings of the Sixth Workshop on African Natural Language Processing (AfricaNLP 2025)

Originality Incremental advance

AI Analysis

This addresses the problem of limited digitized Arabic data for enterprises, though it is incremental as it builds on existing multilingual LLM approaches.

The paper tackled the challenge of building high-quality Arabic LLMs for enterprise use by developing a data synthesis and refinement strategy, resulting in a 7B open-weight model that outperforms peers on Arabic benchmarks.

Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.

View on arXiv PDF

Similar