CLJun 25, 2025

Multi-lingual Functional Evaluation for Large Language Models

arXiv:2506.20793v11 citationsh-index: 21
Originality Synthesis-oriented
AI Analysis

This addresses the need for better practical performance assessment in multi-lingual NLP, but it is incremental as it builds on existing benchmarks by translating them.

The paper tackled the problem of inadequate evaluation of multi-lingual competence in large language models by creating functional benchmarks, revealing that some static benchmarks capture functional performance closely while others show significant performance drops, such as a 24% decrease in English for M-GSM vs. CL-GSM Symbolic.

Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes