MLDSLGSTNov 7, 2024

Statistical-Computational Trade-offs for Recursive Adaptive Partitioning Estimators

arXiv:2411.04394v33 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses a fundamental issue for practitioners using tree-based models in high-dimensional settings, highlighting limitations of greedy algorithms and providing theoretical insights, though it is incremental as it builds on prior work on MSP and neural network comparisons.

The paper tackles the problem of greedy training algorithms for recursive adaptive partitioning estimators, such as decision trees, getting stuck at local optima in high-dimensional regression. It shows that when the true function lacks a specific property (MSP), greedy training requires exponentially many samples (exp(Ω(d))) for low error, but with MSP, it needs only O(log d) samples, while ERM-trained estimators achieve O(log d) samples regardless of MSP, revealing a statistical-computational trade-off.

Models based on recursive adaptive partitioning such as decision trees and their ensembles are popular for high-dimensional regression as they can potentially avoid the curse of dimensionality. Because empirical risk minimization (ERM) is computationally infeasible, these models are typically trained using greedy algorithms. Although effective in many cases, these algorithms have been empirically observed to get stuck at local optima. We explore this phenomenon in the context of learning sparse regression functions over $d$ binary features, showing that when the true regression function $f^*$ does not satisfy Abbe et al. (2022)'s Merged Staircase Property (MSP), greedy training requires $\exp(Ω(d))$ to achieve low estimation error. Conversely, when $f^*$ does satisfy MSP, greedy training can attain small estimation error with only $O(\log d)$ samples. This dichotomy mirrors that of two-layer neural networks trained with stochastic gradient descent (SGD) in the mean-field regime, thereby establishing a head-to-head comparison between SGD-trained neural networks and greedy recursive partitioning estimators. Furthermore, ERM-trained recursive partitioning estimators achieve low estimation error with $O(\log d)$ samples irrespective of whether $f^*$ satisfies MSP, thereby demonstrating a statistical-computational trade-off for greedy training. Our proofs are based on a novel interpretation of greedy recursive partitioning using stochastic process theory and a coupling technique that may be of independent interest.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes