FeatureCuts: Feature Selection for Large Data by Optimizing the Cutoff
This addresses the problem of scalable feature selection for enterprise applications with large datasets, offering incremental improvements in efficiency and reduction.
The paper tackles feature selection for large datasets by introducing FeatureCuts, an algorithm that optimizes feature cutoffs after filter ranking, resulting in an average of 15 percentage points more feature reduction and up to 99.6% less computation time while maintaining model performance compared to state-of-the-art methods.
In machine learning, the process of feature selection involves finding a reduced subset of features that captures most of the information required to train an accurate and efficient model. This work presents FeatureCuts, a novel feature selection algorithm that adaptively selects the optimal feature cutoff after performing filter ranking. Evaluated on 14 publicly available datasets and one industry dataset, FeatureCuts achieved, on average, 15 percentage points more feature reduction and up to 99.6% less computation time while maintaining model performance, compared to existing state-of-the-art methods. When the selected features are used in a wrapper method such as Particle Swarm Optimization (PSO), it enables 25 percentage points more feature reduction, requires 66% less computation time, and maintains model performance when compared to PSO alone. The minimal overhead of FeatureCuts makes it scalable for large datasets typically seen in enterprise applications.