CLOct 15, 2025

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Ahmed Alzubaidi, Shaikha Alsuwaidi, Basma El Amel Boussaha, Leen AlQadi, Omar Alkaabi, Mohammed Alyafeai, Hamza Alobeidli, Hakim Hacid

arXiv:2510.13430v210.96 citationsh-index: 6

Originality Synthesis-oriented

AI Analysis

It provides a comprehensive reference for Arabic NLP researchers by analyzing benchmark methodologies and offering recommendations for future development, though it is incremental as a survey.

This survey systematically reviews over 40 Arabic LLM benchmarks, categorizing them into knowledge, NLP tasks, culture/dialects, and target-specific evaluations, and identifies gaps such as limited temporal assessment and cultural misalignment in translated datasets.

This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.

View on arXiv PDF

Similar