Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm
This provides a tool for researchers to create tailored datasets for benchmarking, though it is incremental as it builds on existing complexity measures and genetic algorithms.
The paper tackled the need for diverse synthetic datasets to evaluate ML methods by developing a genetic algorithm that transforms datasets to achieve target complexity levels for classification and regression tasks, with experiments showing a correlation between generated data complexity and recognition quality.
The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.