FairCauseSyn: Towards Causally Fair LLM-Augmented Synthetic Data Generation
This work addresses the need for equitable health research and healthcare delivery by providing causally fair synthetic data, though it is incremental as it applies an existing causal fairness framework to a new domain with LLM augmentation.
The paper tackled the problem of generating synthetic health data that maintains causal fairness, a gap in existing methods, and achieved results where the generated data deviated by less than 10% from real data on causal fairness metrics and reduced bias by 70% compared to real data when used for training predictors.
Synthetic data generation creates data based on real-world data using generative models. In health applications, generating high-quality data while maintaining fairness for sensitive attributes is essential for equitable outcomes. Existing GAN-based and LLM-based methods focus on counterfactual fairness and are primarily applied in finance and legal domains. Causal fairness provides a more comprehensive evaluation framework by preserving causal structure, but current synthetic data generation methods do not address it in health settings. To fill this gap, we develop the first LLM-augmented synthetic data generation method to enhance causal fairness using real-world tabular health data. Our generated data deviates by less than 10% from real data on causal fairness metrics. When trained on causally fair predictors, synthetic data reduces bias on the sensitive attribute by 70% compared to real data. This work improves access to fair synthetic data, supporting equitable health research and healthcare delivery.