CL CVOct 14, 2022

A Survey of Parameters Associated with the Quality of Benchmarks in NLP

Swaroop Mishra, Anjana Arunkumar, Chris Bryan, Chitta Baral

arXiv:2210.07566v10.61 citationsh-index: 46

Originality Synthesis-oriented

AI Analysis

This work addresses the issue of unreliable benchmarking in NLP for researchers and practitioners, but it is incremental as it focuses on surveying and identifying parameters rather than proposing a complete solution.

The paper tackles the problem of benchmarks in NLP being susceptible to spurious biases that allow models to overfit without truly learning tasks, by surveying and identifying language properties and parameters that capture various aspects of bias to help develop a quality metric for benchmarks.

Several benchmarks have been built with heavy investment in resources to track our progress in NLP. Thousands of papers published in response to those benchmarks have competed to top leaderboards, with models often surpassing human performance. However, recent studies have shown that models triumph over several popular benchmarks just by overfitting on spurious biases, without truly learning the desired task. Despite this finding, benchmarking, while trying to tackle bias, still relies on workarounds, which do not fully utilize the resources invested in benchmark creation, due to the discarding of low quality data, and cover limited sets of bias. A potential solution to these issues -- a metric quantifying quality -- remains underexplored. Inspired by successful quality indices in several domains such as power, food, and water, we take the first step towards a metric by identifying certain language properties that can represent various possible interactions leading to biases in a benchmark. We look for bias related parameters which can potentially help pave our way towards the metric. We survey existing works and identify parameters capturing various properties of bias, their origins, types and impact on performance, generalization, and robustness. Our analysis spans over datasets and a hierarchy of tasks ranging from NLI to Summarization, ensuring that our parameters are generic and are not overfitted towards a specific task or dataset. We also develop certain parameters in this process.

View on arXiv PDF

Similar