CLCVLGSYAug 10, 2020

DQI: A Guide to Benchmark Evaluation

arXiv:2008.03964v18 citations
Originality Incremental advance
AI Analysis

This addresses the issue for ML researchers and practitioners who rely on benchmarks for model evaluation, aiming to improve benchmark design and model generalization, though it appears incremental as it builds on existing concerns about benchmark biases.

The paper tackles the problem of evaluating benchmark quality in machine learning, where models may perform well on one benchmark but fail on others due to spurious biases, and proposes DQI, a novel data quality metric to quantify differences between benchmarks to guide towards models that truly learn underlying tasks.

A `state of the art' model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E. What does B have that the other benchmarks do not? Recent research provides the answer: spurious bias. However, developing A to solve benchmarks B through E does not guarantee that it will solve future benchmarks. To progress towards a model that `truly learns' an underlying task, we need to quantify the differences between successive benchmarks, as opposed to existing binary and black-box approaches. We propose a novel approach to solve this underexplored task of quantifying benchmark quality by debuting a data quality metric: DQI.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes