CLAIFeb 7, 2025

M-IFEval: Multilingual Instruction-Following Evaluation

arXiv:2502.04688v120 citationsh-index: 3NAACL
Originality Synthesis-oriented
AI Analysis

This addresses the problem of evaluating LLMs in diverse cultural contexts for researchers and developers, though it is incremental as it extends an existing benchmark to new languages.

The authors tackled the limitation of existing instruction-following benchmarks being English-only by creating M-IFEval, a multilingual benchmark for French, Japanese, and Spanish, and found that performance across 8 state-of-the-art LLMs varied widely by language and instruction type.

Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages. We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural context.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes