Tree-Values: selective inference for regression trees
This addresses the issue of unreliable inference in regression trees for researchers and practitioners in statistics and machine learning, representing an incremental improvement by adapting selective inference methods to CART.
The paper tackles the problem of conducting valid statistical inference on the output of CART regression trees, which typically fails to control Type 1 error rates due to data-driven tree estimation, by proposing a selective inference framework that conditions on the tree selection process. It results in a test for mean differences between terminal nodes with controlled selective Type 1 error and a confidence interval for node means with nominal selective coverage, validated through simulations and a real dataset on portion control and caloric intake.
We consider conducting inference on the output of the Classification and Regression Tree (CART) [Breiman et al., 1984] algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.