CLDec 19, 2022

Statistical Dataset Evaluation: Reliability, Difficulty, and Validity

Peking U
arXiv:2212.09272v17 citationsh-index: 38
Originality Incremental advance
AI Analysis

This work addresses dataset quality assessment for machine learning practitioners, offering a tool to improve model training and testing, though it is incremental as it builds on classical testing theory.

The paper tackles the problem of biased models and unreliable evaluations caused by dataset issues by proposing a model-agnostic framework to evaluate dataset quality based on reliability, difficulty, and validity, using Named Entity Recognition datasets as a case study with 9 statistical metrics validated by experiments and human evaluation.

Datasets serve as crucial training resources and model performance trackers. However, existing datasets have exposed a plethora of problems, inducing biased models and unreliable evaluation results. In this paper, we propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation. We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity, following a classical testing theory. Taking the Named Entity Recognition (NER) datasets as a case study, we introduce $9$ statistical metrics for a statistical dataset evaluation framework. Experimental results and human evaluation validate that our evaluation framework effectively assesses various aspects of the dataset quality. Furthermore, we study how the dataset scores on our statistical metrics affect the model performance, and appeal for dataset quality evaluation or targeted dataset improvement before training or testing models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes