CLAIDec 2, 2021

How not to Lie with a Benchmark: Rearranging NLP Leaderboards

arXiv:2112.01342v115 citations
Originality Incremental advance
AI Analysis

It addresses a methodological flaw in NLP leaderboards that could mislead comparisons, which is important for researchers and practitioners in the field.

The paper examines the scoring methods of popular NLP benchmarks and rearranges model rankings using geometric and harmonic means, revealing that human-level performance on SuperGLUE has not been achieved and there is room for improvement.

Comparison with a human is an essential requirement for a benchmark for it to be a reliable measurement of model capabilities. Nevertheless, the methods for model comparison could have a fundamental flaw - the arithmetic mean of separate metrics is used for all tasks of different complexity, different size of test and training sets. In this paper, we examine popular NLP benchmarks' overall scoring methods and rearrange the models by geometric and harmonic mean (appropriate for averaging rates) according to their reported results. We analyze several popular benchmarks including GLUE, SuperGLUE, XGLUE, and XTREME. The analysis shows that e.g. human level on SuperGLUE is still not reached, and there is still room for improvement for the current models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes