Machine learning the first stage in 2SLS: Practical guidance from bias decomposition and simulation
This provides practical guidance for econometricians and data scientists using ML in causal inference, though it is incremental as it builds on existing 2SLS and ML frameworks.
The paper tackles the problem of when machine learning (ML) helps or hurts in the first stage of two-stage least squares (2SLS) by decomposing bias into three components and using simulations. It finds that linear ML methods like post-Lasso work well, but nonlinear methods like random forests and neural nets can cause substantial bias in second-stage estimates, potentially exceeding that of endogenous OLS.
Machine learning (ML) primarily evolved to solve "prediction problems." The first stage of two-stage least squares (2SLS) is a prediction problem, suggesting potential gains from ML first-stage assistance. However, little guidance exists on when ML helps 2SLS$\unicode{x2014}$or when it hurts. We investigate the implications of inserting ML into 2SLS, decomposing the bias into three informative components. Mechanically, ML-in-2SLS procedures face issues common to prediction and causal-inference settings$\unicode{x2014}$and their interaction. Through simulation, we show linear ML methods (e.g., post-Lasso) work well, while nonlinear methods (e.g., random forests, neural nets) generate substantial bias in second-stage estimates$\unicode{x2014}$potentially exceeding the bias of endogenous OLS.