Reading Between the Lines: Deconfounding Causal Estimates using Text Embeddings and Deep Learning
This addresses the problem of causal inference in observational studies for researchers and practitioners, offering a novel method to handle high-dimensional text data, though it builds incrementally on existing Double Machine Learning techniques.
The study tackled the problem of selection bias in causal effect estimation from observational data by proposing a Neural Network-Enhanced Double Machine Learning framework that uses text embeddings to capture unobserved confounders. The result showed that their deep learning approach reduced bias to -0.86%, effectively recovering the ground-truth causal parameter, compared to standard tree-based methods with +24% bias.
Estimating causal treatment effects in observational settings is frequently compromised by selection bias arising from unobserved confounders. While traditional econometric methods struggle when these confounders are orthogonal to structured covariates, high-dimensional unstructured text often contains rich proxies for these latent variables. This study proposes a Neural Network-Enhanced Double Machine Learning (DML) framework designed to leverage text embeddings for causal identification. Using a rigorous synthetic benchmark, we demonstrate that unstructured text embeddings capture critical confounding information that is absent from structured tabular data. However, we show that standard tree-based DML estimators retain substantial bias (+24%) due to their inability to model the continuous topology of embedding manifolds. In contrast, our deep learning approach reduces bias to -0.86% with optimized architectures, effectively recovering the ground-truth causal parameter. These findings suggest that deep learning architectures are essential for satisfying the unconfoundedness assumption when conditioning on high-dimensional natural language data