LGFeb 11, 2021

EvoSplit: An evolutionary approach to split a multi-label data set into disjoint subsets

arXiv:2102.06154v43.14 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses a specific data preprocessing challenge for machine learning practitioners working with multi-label datasets, offering incremental improvements over current splitting techniques.

The paper tackles the problem of splitting multi-label datasets into disjoint subsets for supervised learning, introducing EvoSplit, an evolutionary approach that improves upon existing methods like iterative stratification by better preserving label and label pair distributions across subsets.

This paper presents a new evolutionary approach, EvoSplit, for the distribution of multi-label data sets into disjoint subsets for supervised machine learning. Currently, data set providers either divide a data set randomly or using iterative stratification, a method that aims to maintain the label (or label pair) distribution of the original data set into the different subsets. Following the same aim, this paper first introduces a single-objective evolutionary approach that tries to obtain a split that maximizes the similarity between those distributions independently. Second, a new multi-objective evolutionary algorithm is presented to maximize the similarity considering simultaneously both distributions (labels and label pairs). Both approaches are validated using well-known multi-label data sets as well as large image data sets currently used in computer vision and machine learning applications. EvoSplit improves the splitting of a data set in comparison to the iterative stratification following different measures: Label Distribution, Label Pair Distribution, Examples Distribution, folds and fold-label pairs with zero positive examples.

View on arXiv PDF Code

Similar