IRMay 8

MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal

Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim

arXiv:2605.0724993.7

AI Analysis

For researchers and practitioners of multilingual IR and RAG systems, this provides a more nuanced evaluation framework that captures language-aware retrieval utility, which is currently overlooked.

MLAIRE introduces a protocol to evaluate multilingual information retrieval by disentangling semantic accuracy from query-language preference, revealing that standard metrics obscure distinct retriever behaviors. Evaluating 31 retrievers, they show that semantically strong models may return correct content in a non-query language, while language-preferring models may sacrifice relevance.

Multilingual Information Retrieval is increasingly important in real-world search settings, where users issue queries over mixed-language corpora. Existing evaluations mainly reward language-agnostic semantic relevance, treating relevant passages equally regardless of language. Yet retrieval utility also depends on the language of the retrieved passages: users may prefer results they can read and verify in the query language, and query--passage language mismatch can complicate downstream grounding and answer verification in Retrieval-Augmented Generation systems. To evaluate this language-aware dimension, we introduce MLAIRE, a Multilingual Language-Aware Information Retrieval Evaluation protocol that disentangles cross-lingual semantic retrieval from query-language preference. MLAIRE constructs controlled pools with parallel passages across languages, enabling measurement of semantic retrieval accuracy and query-language preference when equivalent translations are available. We propose language-aware metrics, including Language Preference Rate (LPR) and Lang-nDCG, together with a 4-way decomposition separating semantic and query-language preference failures. Evaluating 31 dense, sparse, and late-interaction retrievers, we show that standard metrics obscure distinct behaviors: semantically strong retrievers may return correct content in a non-query language, while retrievers with stronger query-language preference may retrieve less semantically relevant passages.

View on arXiv PDF

Similar