LGAISep 20, 2022

Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

CambridgeMILA
arXiv:2209.10015v136 citationsh-index: 24
Originality Incremental advance
AI Analysis

This addresses data quality challenges for ML practitioners, offering an efficient alternative to existing methods that require strong assumptions or a priori knowledge.

The paper tackles the problem of data quality and diversity issues in machine learning datasets by developing Metadata Archaeology, a unified framework that uncovers and infers metadata of examples. The method achieves performance on par with more sophisticated mitigation methods across tasks like identifying mislabeled examples and classifying minority-group samples.

Modern machine learning research relies on relatively few carefully curated datasets. Even in these datasets, and typically in `untidy' or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we instead focus on providing a unified and efficient framework for Metadata Archaeology -- uncovering and inferring metadata of examples in a dataset. We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these probe suites to infer metadata of interest. Our method is on par with far more sophisticated mitigation methods across different tasks: identifying and correcting mislabeled examples, classifying minority-group samples, prioritizing points relevant for training and enabling scalable human auditing of relevant examples.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes