MLLGAPCOJul 8, 2020

StructureBoost: Efficient Gradient Boosting for Structured Categorical Variables

arXiv:2007.04446v13 citations
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in machine learning for domains with complex categorical data, offering incremental improvements over existing gradient boosting packages.

The paper tackled the computational infeasibility of gradient boosting with structured categorical variables for high-cardinality cases, proposing two efficient methods that outperform CatBoost and LightGBM on problems with sophisticated categorical structures and enable accurate predictions on unseen values.

Gradient boosting methods based on Structured Categorical Decision Trees (SCDT) have been demonstrated to outperform numerical and one-hot-encodings on problems where the categorical variable has a known underlying structure. However, the enumeration procedure in the SCDT is infeasible except for categorical variables with low or moderate cardinality. We propose and implement two methods to overcome the computational obstacles and efficiently perform Gradient Boosting on complex structured categorical variables. The resulting package, called StructureBoost, is shown to outperform established packages such as CatBoost and LightGBM on problems with categorical predictors that contain sophisticated structure. Moreover, we demonstrate that StructureBoost can make accurate predictions on unseen categorical values due to its knowledge of the underlying structure.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes