CLAIJun 5

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

arXiv:2606.071679.0Has Code
Originality Incremental advance
AI Analysis

Provides the first native-sourced MMLU-style benchmark for Urdu, revealing that current LLMs have uneven knowledge, especially in culturally grounded content.

UrduMMLU introduces a 26,431-question benchmark for Urdu language understanding, finding that Gemini-3.5-Flash achieves 90.34% accuracy while open-source models lag by up to 8.92 points, with significant gaps in region-specific subjects.

Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes