CLApr 5, 2021

What Will it Take to Fix Benchmarking in Natural Language Understanding?

arXiv:2104.02145v3790 citations
AI Analysis

This addresses the problem of flawed benchmarking in natural language understanding for researchers and practitioners, but it is incremental as it critiques existing approaches without presenting new empirical results.

The paper argues that current NLU benchmarks are broken due to unreliable and biased systems scoring too highly, leaving little room for improvement, and proposes four criteria for better benchmarks to restore a healthy evaluation ecosystem.

Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes