MEApr 6, 2021
Variable selection with missing data in both covariates and outcomes: Imputation and machine learningLiangyuan Hu, Jung-Yi Joyce Lin, Jiayi Ji
The missing data issue is ubiquitous in health studies. Variable selection in the presence of both missing covariates and outcomes is an important statistical research topic but has been less studied. Existing literature focuses on parametric regression techniques that provide direct parameter estimates of the regression model. Flexible nonparametric machine learning methods considerably mitigate the reliance on the parametric assumptions, but do not provide as naturally defined variable importance measure as the covariate effect native to parametric models. We investigate a general variable selection approach when both the covariates and outcomes can be missing at random and have general missing data patterns. This approach exploits the flexibility of machine learning modeling techniques and bootstrap imputation, which is amenable to nonparametric methods in which the covariate effects are not directly available. We conduct expansive simulations investigating the practical operating characteristics of the proposed variable selection approach, when combined with four tree-based machine learning methods, XGBoost, Random Forests, Bayesian Additive Regression Trees (BART) and Conditional Random Forests, and two commonly used parametric methods, lasso and backward stepwise selection. Numeric results suggest that when combined with bootstrap imputation, XGBoost and BART have the overall best variable selection performance with respect to the $F_1$ score and Type I error across various settings. In general, there is no significant difference in the variable selection performance due to imputation methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome with data from the Study of Women's Health Across the Nation.
MEAug 18, 2020
Estimation of causal effects of multiple treatments in healthcare database studies with rare outcomesLiangyuan Hu, Chenyang Gu
The preponderance of large-scale healthcare databases provide abundant opportunities for comparative effectiveness research. Evidence necessary to making informed treatment decisions often relies on comparing effectiveness of multiple treatment options on outcomes of interest observed in a small number of individuals. Causal inference with multiple treatments and rare outcomes is a subject that has been treated sparingly in the literature. This paper designs three sets of simulations, representative of the structure of our healthcare database study, and propose causal analysis strategies for such settings. We investigate and compare the operating characteristics of three types of methods and their variants: Bayesian Additive Regression Trees (BART), regression adjustment on multivariate spline of generalized propensity scores (RAMS) and inverse probability of treatment weighting (IPTW) with multinomial logistic regression or generalized boosted models. Our results suggest that BART and RAMS provide lower bias and mean squared error, and the widely used IPTW methods deliver unfavorable operating characteristics. We illustrate the methods using a case study evaluating the comparative effectiveness of robotic-assisted surgery, video-assisted thoracoscopic surgery and open thoracotomy for treating non-small cell lung cancer.
APAug 17, 2020
Estimating heterogeneous survival treatment effect in observational data using machine learningLiangyuan Hu, Jiayi Ji, Fan Li
Methods for estimating heterogeneous treatment effect in observational data have largely focused on continuous or binary outcomes, and have been relatively less vetted with survival outcomes. Using flexible machine learning methods in the counterfactual framework is a promising approach to address challenges due to complex individual characteristics, to which treatments need to be tailored. To evaluate the operating characteristics of recent survival machine learning methods for the estimation of treatment effect heterogeneity and inform better practice, we carry out a comprehensive simulation study presenting a wide range of settings describing confounded heterogeneous survival treatment effects and varying degrees of covariate overlap. Our results suggest that the nonparametric Bayesian Additive Regression Trees within the framework of accelerated failure time model (AFT-BART-NP) consistently yields the best performance, in terms of bias, precision and expected regret. Moreover, the credible interval estimators from AFT-BART-NP provide close to nominal frequentist coverage for the individual survival treatment effect when the covariate overlap is at least moderate. Including a non-parametrically estimated propensity score as an additional fixed covariate in the AFT-BART-NP model formulation can further improve its efficiency and frequentist coverage. Finally, we demonstrate the application of flexible causal machine learning estimators through a comprehensive case study examining the heterogeneous survival effects of two radiotherapy approaches for localized high-risk prostate cancer.