Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection
This addresses a fundamental flaw in GBM for practitioners relying on feature importance, though it is incremental as it builds on known biases.
The paper tackles the bias in feature importance measures of Gradient Boosting Machines due to categorical variable cardinality, showing that using cross-validated unbiased base learners significantly improves these measures while maintaining similar prediction accuracy.
Gradient Boosting Machines (GBM) are among the go-to algorithms on tabular data, which produce state of the art results in many prediction tasks. Despite its popularity, the GBM framework suffers from a fundamental flaw in its base learners. Specifically, most implementations utilize decision trees that are typically biased towards categorical variables with large cardinalities. The effect of this bias was extensively studied over the years, mostly in terms of predictive performance. In this work, we extend the scope and study the effect of biased base learners on GBM feature importance (FI) measures. We show that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI. By utilizing cross-validated (CV) unbiased base learners, we fix this flaw at a relatively low computational cost. We demonstrate the suggested framework in a variety of synthetic and real-world setups, showing a significant improvement in all GBM FI measures while maintaining relatively the same level of prediction accuracy.