LGMay 25, 2023

Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors

arXiv:2305.15696v13 citationsHas Code
Originality Incremental advance
AI Analysis

This provides a tool for data scientists and ML practitioners to audit datasets for common real-world issues, though it is incremental as it builds on existing drift detection methods.

The paper tackles the problem of detecting violations of the IID assumption in datasets, such as distributional drift or non-random ordering, by proposing a k-nearest neighbors-based statistical test that works across various data types like numeric, image, text, and audio.

We present a straightforward statistical test to detect certain violations of the assumption that the data are Independent and Identically Distributed (IID). The specific form of violation considered is common across real-world applications: whether the examples are ordered in the dataset such that almost adjacent examples tend to have more similar feature values (e.g. due to distributional drift, or attractive interactions between datapoints). Based on a k-Nearest Neighbors estimate, our approach can be used to audit any multivariate numeric data as well as other data types (image, text, audio, etc.) that can be numerically represented, perhaps with model embeddings. Compared with existing methods to detect drift or auto-correlation, our approach is both applicable to more types of data and also able to detect a wider variety of IID violations in practice. Code: https://github.com/cleanlab/cleanlab

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes