MLLGMay 26, 2023

Detecting Errors in a Numerical Response via any Regression Model

arXiv:2305.16583v3
Originality Incremental advance
AI Analysis

This work addresses noise and error detection in numerical data for applications like sensor data or human estimates, but it is incremental as it builds on existing regression models with a new filtering procedure.

The paper tackles the problem of detecting errors in numerical datasets by introducing veracity scores to distinguish genuine errors from natural fluctuations, and demonstrates that their method achieves better precision and recall than other approaches on a new benchmark of 5 real-world regression datasets.

Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider general regression settings with covariates and a potentially corrupted response whose observed values may contain errors. By accounting for various uncertainties, we introduced veracity scores that distinguish between genuine errors and natural data fluctuations, conditioned on the available covariate information in the dataset. We propose a simple yet efficient filtering procedure for eliminating potential errors, and establish theoretical guarantees for our method. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes