CLFeb 28, 2025Code
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMsFakhraddin Alwajih, Abdellah El Mekki, Samar Mohamed Magdy et al.
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
CLMay 28, 2025
Pearl: A Multimodal Culturally-Aware Arabic Instruction DatasetFakhraddin Alwajih, Samar M. Magdy, Abdellah El Mekki et al.
Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce PEARL, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 37 annotators from across the Arab world, PEARL comprises over 309K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks (PEARL and PEARL-LITE) along with a specialized subset (PEARL-X) explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models' cultural grounding compared to conventional scaling methods. PEARL establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.
CLJan 19
Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMsAbdellah El Mekki, Samar M. Magdy, Houdaifa Atou et al.
Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic. Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce \textbf{Alexandria}, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total samples, Alexandria serves as both a training resource and a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation of Arabic-aware LLMs benchmarks current capabilities in translating across diverse Arabic dialects and sub-dialects, while exposing significant persistent challenges.
CLMay 15, 2015
Arabic Inquiry-Answer Dialogue Acts Annotation SchemaAbdelRahim A. Elmadany, Sherif M. Abdou, Mervat Gheith
We present an annotation schema as part of an effort to create a manually annotated corpus for Arabic dialogue language understanding including spoken dialogue and written "chat" dialogue for inquiry-answer domain. The proposed schema handles mainly the request and response acts that occurs frequently in inquiry-answer debate conversations expressing request services, suggests, and offers. We applied the proposed schema on 83 Arabic inquiry-answer dialogues.
CLMay 12, 2015
A Survey of Arabic Dialogues Understanding for Spontaneous Dialogues and Instant MessageAbdelRahim A. Elmadany, Sherif M. Abdou, Mervat Gheith
Building dialogues systems interaction has recently gained considerable attention, but most of the resources and systems built so far are tailored to English and other Indo-European languages. The need for designing systems for other languages is increasing such as Arabic language. For this reasons, there are more interest for Arabic dialogue acts classification task because it a key player in Arabic language understanding to building this systems. This paper surveys different techniques for dialogue acts classification for Arabic. We describe the main existing techniques for utterances segmentations and classification, annotation schemas, and test corpora for Arabic dialogues understanding that have introduced in the literature
CLMay 12, 2015
Turn Segmentation into Utterances for Arabic Spontaneous Dialogues and Instance MessagesAbdelRahim A. Elmadany, Sherif M. Abdou, Mervat Gheith
Text segmentation task is an essential processing task for many of Natural Language Processing (NLP) such as text summarization, text translation, dialogue language understanding, among others. Turns segmentation considered the key player in dialogue understanding task for building automatic Human-Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances for Egyptian spontaneous dialogues and Instance Messages (IM) using Machine Learning (ML) approach as a part of automatic understanding Egyptian spontaneous dialogues and IM task. Due to the lack of Egyptian dialect dialogue corpus the system evaluated by our corpus includes 3001 turns, which are collected, segmented, and annotated manually from Egyptian call-centers. The system achieves F1 scores of 90.74% and accuracy of 95.98%.