Encoding of data sets and algorithms
This work addresses the need for rigorous quality assurance in high-impact ML applications, though it appears incremental as it builds on existing concepts of metrics and spaces without introducing a new paradigm.
The paper tackles the problem of ensuring machine learning algorithm quality and reliability relative to complexity by developing a mathematically rigorous theory to measure closeness between models using metrics like performance and complexity, involving the creation of grids on data set and algorithm spaces to define statistical distances.
In many high-impact applications, it is important to ensure the quality of output of a machine learning algorithm as well as its reliability in comparison with the complexity of the algorithm used. In this paper, we have initiated a mathematically rigorous theory to decide which models (algorithms applied on data sets) are close to each other in terms of certain metrics, such as performance and the complexity level of the algorithm. This involves creating a grid on the hypothetical spaces of data sets and algorithms so as to identify a finite set of probability distributions from which the data sets are sampled and a finite set of algorithms. A given threshold metric acting on this grid will express the nearness (or statistical distance) from each algorithm and data set of interest to any given application. A technically difficult part of this project is to estimate the so-called metric entropy of a compact subset of functions of \textbf{infinitely many variables} that arise in the definition of these spaces.