ControlBurn: Feature Selection by Sparse Forests
This addresses feature selection challenges for interpretability in machine learning, particularly in domains with correlated data, though it is incremental as it builds on LASSO and tree ensemble methods.
The paper tackles the problem of feature selection in tree ensembles where correlated features suppress average importance rankings, reducing interpretability. It introduces ControlBurn, a weighted LASSO-based algorithm that prunes unnecessary features efficiently in a single training iteration, showing substantially better performance on datasets with correlated features compared to methods of similar computational cost.
Tree ensembles distribute feature importance evenly amongst groups of correlated features. The average feature ranking of the correlated group is suppressed, which reduces interpretability and complicates feature selection. In this paper we present ControlBurn, a feature selection algorithm that uses a weighted LASSO-based feature selection method to prune unnecessary features from tree ensembles, just as low-intensity fire reduces overgrown vegetation. Like the linear LASSO, ControlBurn assigns all the feature importance of a correlated group of features to a single feature. Moreover, the algorithm is efficient and only requires a single training iteration to run, unlike iterative wrapper-based feature selection methods. We show that ControlBurn performs substantially better than feature selection methods with comparable computational costs on datasets with correlated features.