The AI Language Proficiency Monitor -- Tracking the Progress of LLMs on Multilingual Benchmarks
This work addresses the need for equitable access to LLM benefits by enabling systematic evaluation across languages, particularly for low-resource languages, to foster transparency and inclusivity in multilingual AI.
The authors tackled the problem of evaluating large language models (LLMs) across diverse languages by introducing the AI Language Proficiency Monitor, a comprehensive multilingual benchmark that assesses performance across up to 200 languages, with a focus on low-resource languages, and provides an open-source, auto-updating leaderboard and dashboard.
To ensure equitable access to the benefits of large language models (LLMs), it is essential to evaluate their capabilities across the world's languages. We introduce the AI Language Proficiency Monitor, a comprehensive multilingual benchmark that systematically assesses LLM performance across up to 200 languages, with a particular focus on low-resource languages. Our benchmark aggregates diverse tasks including translation, question answering, math, and reasoning, using datasets such as FLORES+, MMLU, GSM8K, TruthfulQA, and ARC. We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance. In addition to ranking models, the platform offers descriptive insights such as a global proficiency map and trends over time. By complementing and extending prior multilingual benchmarks, our work aims to foster transparency, inclusivity, and progress in multilingual AI. The system is available at https://huggingface.co/spaces/fair-forward/evals-for-every-language.