LG IT MLJan 9, 2020

What is the Value of Data? On Mathematical Methods for Data Quality Estimation

Netanel Raviv, Siddharth Jain, Jehoshua Bruck

arXiv:2001.03464v21.26 citations

Originality Incremental advance

AI Analysis

This work addresses the lack of rigorous methods for data quality estimation, which is a foundational issue in machine learning and data science, though it is incremental in building on existing concepts like active learning.

The paper tackles the problem of rigorously assessing data quality by proposing a formal definition based on the expected diameter, which measures disagreement between hypotheses explaining the dataset, and provides theoretical guarantees and practical solutions for its computation, with experimental validation.

Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this paper, we propose a formal definition for the quality of a given dataset. We assess a dataset's quality by a quantity we call the expected diameter, which measures the expected disagreement between two randomly chosen hypotheses that explain it, and has recently found applications in active learning. We focus on Boolean hyperplanes, and utilize a collection of Fourier analytic, algebraic, and probabilistic methods to come up with theoretical guarantees and practical solutions for the computation of the expected diameter. We also study the behaviour of the expected diameter on algebraically structured datasets, conduct experiments that validate this notion of quality, and demonstrate the feasibility of our techniques.

View on arXiv PDF

Similar