CL AP MEJun 9, 2023

Leveraging text data for causal inference using electronic health records

Reagan Mozer, Aaron R. Kaufman, Leo A. Celi, Luke Miratrix

arXiv:2307.03687v21.73 citationsh-index: 64Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of limited structured data in clinical research, particularly in developing countries, by enabling broader use of text data for causal inference, though it is incremental as it builds on existing methods.

The paper tackles the problem of ignoring unstructured text data in electronic health records for causal inference by presenting a unified framework that combines natural language processing with standard techniques to address missing data, confounding bias, and treatment effect heterogeneity, showing in an application that incorporating text data strengthens treatment effect validity and identifies patient subgroups.

In studies that rely on data from electronic health records (EHRs), unstructured text data such as clinical progress notes offer a rich source of information about patient characteristics and care that may be missing from structured data. Despite the prevalence of text in clinical research, these data are often ignored for the purposes of quantitative analysis due their complexity. This paper presents a unified framework for leveraging text data to support causal inference with electronic health data at multiple stages of analysis. In particular, we consider how natural language processing and statistical text analysis can be combined with standard inferential techniques to address common challenges due to missing data, confounding bias, and treatment effect heterogeneity. Through an application to a recent EHR study investigating the effects of a non-randomized medical intervention on patient outcomes, we show how incorporating text data in a traditional matching analysis can help strengthen the validity of an estimated treatment effect and identify patient subgroups that may benefit most from treatment. We believe these methods have the potential to expand the scope of secondary analysis of clinical data to domains where structured EHR data is limited, such as in developing countries. To this end, we provide code and open-source replication materials to encourage adoption and broader exploration of these techniques in clinical research.

View on arXiv PDF Code

Similar