Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation
This work addresses a critical bottleneck for researchers and practitioners in causal inference by providing a systematic comparison of model selection methods, but it is incremental as it builds on prior metrics without introducing a fundamentally new paradigm.
The paper tackles the problem of model selection for conditional average treatment effect (CATE) estimation in causal inference, where cross-validation is not directly applicable due to unobserved counterfactuals. It conducts an extensive empirical analysis to benchmark existing and novel surrogate metrics, finding that strategies based on hyperparameter tuning and causal ensembling improve performance, though no specific numerical gains are reported.
We study the problem of model selection in causal inference, specifically for conditional average treatment effect (CATE) estimation. Unlike machine learning, there is no perfect analogue of cross-validation for model selection as we do not observe the counterfactual potential outcomes. Towards this, a variety of surrogate metrics have been proposed for CATE model selection that use only observed data. However, we do not have a good understanding regarding their effectiveness due to limited comparisons in prior studies. We conduct an extensive empirical analysis to benchmark the surrogate model selection metrics introduced in the literature, as well as the novel ones introduced in this work. We ensure a fair comparison by tuning the hyperparameters associated with these metrics via AutoML, and provide more detailed trends by incorporating realistic datasets via generative modeling. Our analysis suggests novel model selection strategies based on careful hyperparameter selection of CATE estimators and causal ensembling.