Generating Synthetic Text Data to Evaluate Causal Inference Methods
This work addresses the challenge of validating causal inference techniques for unstructured text data, which is crucial for researchers in fields like natural language processing and causal analysis, though it is incremental as it adapts existing generation models.
The authors tackled the problem of evaluating causal inference methods on high-dimensional text data by developing a framework to generate synthetic text datasets with known causal effects, and used it to empirically compare four existing methods.
Drawing causal conclusions from observational data requires making assumptions about the true data-generating process. Causal inference research typically considers low-dimensional data, such as categorical or numerical fields in structured medical records. High-dimensional and unstructured data such as natural language complicates the evaluation of causal inference methods; such evaluations rely on synthetic datasets with known causal effects. Models for natural language generation have been widely studied and perform well empirically. However, existing methods not immediately applicable to producing synthetic datasets for causal evaluations, as they do not allow for quantifying a causal effect on the text itself. In this work, we develop a framework for adapting existing generation models to produce synthetic text datasets with known causal effects. We use this framework to perform an empirical comparison of four recently-proposed methods for estimating causal effects from text data. We release our code and synthetic datasets.