CLLGSep 4, 2025

What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

arXiv:2509.04032v2h-index: 46Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of evaluating and improving multilingual reliability for AI systems, though it is incremental as it applies an existing metric to new data.

The study measured functional similarity across languages using the κ_p metric on 20 languages and 47 subjects in GlobalMMLU, finding that model responses become more consistent across languages as size and capability increase, with models showing greater cross-lingual consistency internally than agreement with other models in the same language.

How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric $κ_p$ applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model's responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of $κ_p$ as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes