Quantifying Language Disparities in Multilingual Large Language Models
This work addresses the challenge of reliably measuring language disparities in multilingual models, which is crucial for fairness and evaluation in AI, though it is incremental as it builds on existing evaluation methods.
The authors tackled the problem of fragmented and confounded evaluations in multilingual large language models by proposing a framework with three interpretable metrics to quantify performance disparities across models and languages. Their case study on 13 model variants and 11 datasets showed that higher overall model performance does not necessarily lead to greater fairness across languages, particularly for low-resource languages.
Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices. We propose a framework that disentangles these confounding variables and introduces three interpretable metrics--the performance realisation ratio, its coefficient of variation, and language potential--enabling a finer-grained and more insightful quantification of actual performance disparities across both (i) models and (ii) languages. Through a case study of 13 model variants on 11 multilingual datasets, we demonstrate that our framework provides a more reliable measurement of model performance and language disparities, particularly for low-resource languages, which have so far proven challenging to evaluate. Importantly, our results reveal that higher overall model performance does not necessarily imply greater fairness across languages.