From tests to effect sizes: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation benchmarks
This work addresses the need for more robust statistical evaluation in NLP benchmarks, particularly for researchers and practitioners dealing with multilingual and multitask settings, though it is incremental as it builds on existing resampling techniques.
The paper tackles the problem of quantifying uncertainty and statistical variability in multilingual and multitask NLP benchmarks by introducing resampling-based methods, showing that accounting for both model- and data-related sources is necessary to avoid underestimating variability, and demonstrating their utility in computing sampling distributions for metrics like averages, differences, and rankings across tasks such as question answering, machine translation, and named entity recognition.
In this paper, we introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks. We show how experimental variation in performance scores arises from both model- and data-related sources, and that accounting for both of them is necessary to avoid substantially underestimating the overall variability over hypothetical replications. Using multilingual question answering, machine translation, and named entity recognition as example tasks, we also demonstrate how resampling methods are useful for computing sampling distributions for various quantities used in leaderboards such as the average/median, pairwise differences between models, and rankings.