CLAIJun 24, 2025

MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages

arXiv:2506.19468v16 citationsh-index: 49
Originality Incremental advance
AI Analysis

This addresses the need for comprehensive and aligned multilingual assessment in AI, though it is incremental as it builds on existing benchmarking efforts.

The paper tackled the problem of fragmented multilingual evaluation for large language models by introducing MuBench, a benchmark covering 61 languages, and found notable gaps in claimed vs. actual language coverage, with a persistent performance disparity between English and low-resource languages.

Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage, particularly a persistent performance disparity between English and low-resource languages. Leveraging MuBench's alignment, we propose Multilingual Consistency (MLC) as a complementary metric to accuracy for analyzing performance bottlenecks and guiding model improvement. Finally, we pretrain a suite of 1.2B-parameter models on English and Chinese with 500B tokens, varying language ratios and parallel data proportions to investigate cross-lingual transfer dynamics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes