Intra-tree Column Subsampling Hinders XGBoost Learning of Ratio-like Interactions
This addresses a practical issue for users of gradient boosted trees in applied problems where ratio-like signals are common, but the findings are incremental as they focus on a specific subsampling technique in XGBoost.
The paper investigates how intra-tree column subsampling in XGBoost affects learning of ratio-like interactions, finding that it reduces test PR-AUC by up to 54% in synthetic data with cancellation-style structure, but this effect disappears when engineered ratio features are included.
Many applied problems contain signal that becomes clear only after combining multiple raw measurements. Ratios and rates are common examples. In gradient boosted trees, this combination is not an explicit operation: the model must synthesize it through coordinated splits on the component features. We study whether intra-tree column subsampling in XGBoost makes that synthesis harder. We use two synthetic data generating processes with cancellation-style structure. In both, two primitive features share a strong nuisance factor, while the target depends on a smaller differential factor. A log ratio cancels the nuisance and isolates the signal. We vary colsample_bylevel and colsample_bynode over s in {0.4, 0.6, 0.8, 0.9}, emphasizing mild subsampling (s >= 0.8). A control feature set includes the engineered ratio, removing the need for synthesis. Across both processes, intra-tree column subsampling reduces test PR-AUC in the primitives-only setting. In the main process the relative decrease reaches 54 percent when both parameters are set to 0.4. The effect largely disappears when the engineered ratio is present. A path-based co-usage metric drops in the same cells where performance deteriorates. Practically, if ratio-like structure is plausible, either avoid intra-tree subsampling or include the intended ratio features.