CLJul 16, 2025

Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models

arXiv:2507.11882v11 citationsh-index: 13Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the need for better multilingual evaluation benchmarks for LLMs, though it is incremental as it extends an existing dataset with localization.

The paper tackles the problem of evaluating multilingual instruction-following in Large Language Models by introducing Marco-Bench-MIF, a localized benchmark covering 30 languages, and finds significant accuracy gaps (25-35%) between high- and low-resource languages and underestimation by machine-translated data (7-22%).

Instruction-following capability has become a major ability to be evaluated for Large Language Models (LLMs). However, existing datasets, such as IFEval, are either predominantly monolingual and centered on English or simply machine translated to other languages, limiting their applicability in multilingual contexts. In this paper, we present an carefully-curated extension of IFEval to a localized multilingual version named Marco-Bench-MIF, covering 30 languages with varying levels of localization. Our benchmark addresses linguistic constraints (e.g., modifying capitalization requirements for Chinese) and cultural references (e.g., substituting region-specific company names in prompts) via a hybrid pipeline combining translation with verification. Through comprehensive evaluation of 20+ LLMs on our Marco-Bench-MIF, we found that: (1) 25-35% accuracy gap between high/low-resource languages, (2) model scales largely impact performance by 45-60% yet persists script-specific challenges, and (3) machine-translated data underestimates accuracy by7-22% versus localized data. Our analysis identifies challenges in multilingual instruction following, including keyword consistency preservation and compositional constraint adherence across languages. Our Marco-Bench-MIF is available at https://github.com/AIDC-AI/Marco-Bench-MIF.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes