MLApr 27, 2021

Sample selection from a given dataset to validate machine learning models

arXiv:2104.14401v11.9

Originality Synthesis-oriented

AI Analysis

This addresses the need for efficient validation in industrial machine learning applications, though it appears incremental as it adapts existing statistical concepts to this specific context.

The paper tackled the problem of selecting a validation subset from a dataset for evaluating machine learning models in industrial settings, proposing a method based on design of experiments and support points using Maximum Mean Discrepancy criteria, and demonstrated its practical relevance through a test case at EDF.

The selection of a validation basis from a full dataset is often required in industrial use of supervised machine learning algorithm. This validation basis will serve to realize an independent evaluation of the machine learning model. To select this basis, we propose to adopt a "design of experiments" point of view, by using statistical criteria. We show that the "support points" concept, based on Maximum Mean Discrepancy criteria, is particularly relevant. An industrial test case from the company EDF illustrates the practical interest of the methodology.

View on arXiv PDF

Similar