CLMay 25, 2025

BnMMLU: Measuring Massive Multitask Language Understanding in Bengali

arXiv:2505.18951v19.65 citationsh-index: 1Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of evaluating language models for Bengali, a low-resource language, by providing a new benchmark, though it is incremental as it adapts an existing framework to a new language.

The authors tackled the underrepresentation of low-resource languages like Bengali in language model evaluation by introducing BnMMLU, a benchmark with 138,949 question-option pairs across 23 domains, and found significant performance gaps in models tested.

The Massive Multitask Language Understanding (MMLU) benchmark has been widely used to evaluate language models across various domains. However, existing MMLU datasets primarily focus on high-resource languages such as English, which leaves low-resource languages like Bengali underrepresented. In this paper, we introduce BnMMLU, a benchmark to evaluate the multitask language understanding capabilities of Bengali in language models. The dataset spans 23 domains, including science, humanities, mathematics and general knowledge and is structured in a multiple-choice format to assess factual knowledge, application-based problem-solving and reasoning abilities of language models. It consists of 138,949 question-option pairs. We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set. Additionally, we annotate the test set with three cognitive categories-factual knowledge, procedural application and reasoning-to gain deeper insights into model strengths and weaknesses across various cognitive tasks. The results reveal significant performance gaps, highlighting the need for improved pre-training and fine-tuning strategies tailored to Bengali data. We release the dataset and benchmark results to facilitate further research in this area.

View on arXiv PDF Code

Similar