CLJun 25, 2025

Multi-lingual Functional Evaluation for Large Language Models

Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian

arXiv:2506.20793v14.91 citationsh-index: 21

Originality Synthesis-oriented

AI Analysis

This addresses the need for better practical performance assessment in multi-lingual NLP, but it is incremental as it builds on existing benchmarks by translating them.

The paper tackled the problem of inadequate evaluation of multi-lingual competence in large language models by creating functional benchmarks, revealing that some static benchmarks capture functional performance closely while others show significant performance drops, such as a 24% decrease in English for M-GSM vs. CL-GSM Symbolic.

Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.

View on arXiv PDF

Similar