StructureBoost: Efficient Gradient Boosting for Structured Categorical Variables
This work addresses a bottleneck in machine learning for domains with complex categorical data, offering incremental improvements over existing gradient boosting packages.
The paper tackled the computational infeasibility of gradient boosting with structured categorical variables for high-cardinality cases, proposing two efficient methods that outperform CatBoost and LightGBM on problems with sophisticated categorical structures and enable accurate predictions on unseen values.
Gradient boosting methods based on Structured Categorical Decision Trees (SCDT) have been demonstrated to outperform numerical and one-hot-encodings on problems where the categorical variable has a known underlying structure. However, the enumeration procedure in the SCDT is infeasible except for categorical variables with low or moderate cardinality. We propose and implement two methods to overcome the computational obstacles and efficiently perform Gradient Boosting on complex structured categorical variables. The resulting package, called StructureBoost, is shown to outperform established packages such as CatBoost and LightGBM on problems with categorical predictors that contain sophisticated structure. Moreover, we demonstrate that StructureBoost can make accurate predictions on unseen categorical values due to its knowledge of the underlying structure.