CL AIMay 29

On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets

Ana Gjorgjevikj, Barbara Koroušić Seljak, Tome Eftimov

arXiv:2605.3114217.3

Predicted impact top 61% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This study addresses the problem of inconsistent conclusions about multilingual model superiority for researchers and practitioners by analyzing the sensitivity of model rankings to dataset compositions and performance aggregation methods.

This paper investigates the robustness of multilingual text embedding model rankings across various learning tasks, languages, and benchmark datasets. It found that large-scale LLM-based models are generally robust top performers in task-specific analyses, but only a small subset of models consistently performs well across tasks, ranking schemes, and data subsamples in task-agnostic evaluations.

Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, multi-task settings remains insufficiently understood. Although benchmarking platforms such as MTEB report results across more than 250 languages, conclusions about model superiority often depend on implicit choices of dataset compositions and performance aggregation methods. To address this gap, we present a meta-study of multilingual model performance robustness in MTEB, applying a diverse set of multi-criteria decision-making ranking schemes and introducing two robustness indicators: dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change). They enable systematic sensitivity analysis of whether benchmarking conclusions remain stable under different evaluation designs. We conduct an in-depth analysis on five languages (English, French, German, Hindi, and Spanish) across nine tasks (e.g., classification, clustering, retrieval) and release results for approximately 230 additional languages. The task-specific analyses show that large-scale LLM-based models are often robust top performers, though not uniformly (e.g., in retrieval task), while task-agnostic results reveal that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples.

View on arXiv PDF

Similar