IRCLLGJan 31, 2025

mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval

AI2
arXiv:2501.19264v111 citationsh-index: 38ECIR
Originality Synthesis-oriented
AI Analysis

This addresses the problem of evaluating instruction-following retrieval models across languages for researchers, though it is incremental as it builds on existing TREC narratives.

The authors tackled the lack of multilingual benchmarks for instruction-following in retrieval by introducing mFollowIR, a benchmark across Russian, Chinese, and Persian, and found that English-based retrievers perform well cross-lingually but show a notable drop in multilingual settings.

Retrieval systems generally focus on web-style queries that are short and underspecified. However, advances in language models have facilitated the nascent rise of retrieval models that can understand more complex queries with diverse intents. However, these efforts have focused exclusively on English; therefore, we do not yet understand how they work across languages. We introduce mFollowIR, a multilingual benchmark for measuring instruction-following ability in retrieval models. mFollowIR builds upon the TREC NeuCLIR narratives (or instructions) that span three diverse languages (Russian, Chinese, Persian) giving both query and instruction to the retrieval models. We make small changes to the narratives and isolate how well retrieval models can follow these nuanced changes. We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance. We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting, indicating that more work is needed in developing data for instruction-based multilingual retrievers.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes