CL AIJun 5

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

Ahmer Tabassum, Sarfraz Ahmad, Hasan Iqbal, Owais Aijaz, Momina Ahsan, Preslav Nakov

arXiv:2606.071679.0Has Code

Originality Incremental advance

AI Analysis

Provides the first native-sourced MMLU-style benchmark for Urdu, revealing that current LLMs have uneven knowledge, especially in culturally grounded content.

UrduMMLU introduces a 26,431-question benchmark for Urdu language understanding, finding that Gemini-3.5-Flash achieves 90.34% accuracy while open-source models lag by up to 8.92 points, with significant gaps in region-specific subjects.

Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.

View on arXiv PDF

Similar