Causal Inference on Outcomes Learned from Text
This work addresses the challenge of causal inference from text data for researchers in fields like economics and social sciences, but it is incremental as it builds on existing econometric frameworks and LLM applications.
The authors tackled the problem of performing causal inference on outcomes derived from text in randomized trials, proposing a machine-learning tool that uses large language models to identify systematic differences between groups and provides valid inference through sample splitting and human validation. They demonstrated the tool in a proof-of-concept application using academic manuscript abstracts.
We propose a machine-learning tool that yields causal inference on text in randomized trials. Based on a simple econometric framework in which text may capture outcomes of interest, our procedure addresses three questions: First, is the text affected by the treatment? Second, which outcomes is the effect on? And third, how complete is our description of causal effects? To answer all three questions, our approach uses large language models (LLMs) that suggest systematic differences across two groups of text documents and then provides valid inference based on costly validation. Specifically, we highlight the need for sample splitting to allow for statistical validation of LLM outputs, as well as the need for human labeling to validate substantive claims about how documents differ across groups. We illustrate the tool in a proof-of-concept application using abstracts of academic manuscripts.