CLNov 29, 2024

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin

arXiv:2411.19799v122.366 citationsh-index: 25ICLR

Originality Synthesis-oriented

AI Analysis

This addresses the bottleneck in developing effective multilingual LLMs for diverse communities, though it is incremental as it focuses on benchmark creation rather than model improvement.

The authors tackled the lack of high-quality evaluation resources for multilingual language models by constructing INCLUDE, a benchmark of 197,243 QA pairs from local exam sources across 44 languages, which measures performance in regional contexts.

The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.

View on arXiv PDF

Similar