Differential Subgroup Discovery: Characterizing Where Two Populations Differ, and Why
For practitioners in clinical analysis, model diagnostics, and treatment-effect studies, this work provides a principled method to pinpoint covariate combinations driving population-level gaps, with causal interpretability under certain conditions.
The paper introduces a formal framework for differential subgroup discovery—identifying subsets where two populations differ most in a target outcome—and proposes DiffSub, a gradient-based method that finds interpretable subgroups. Across synthetic and real-world benchmarks (medical, model-error, treatment-effect), DiffSub effectively reveals where and why population differences occur.
We study the problem of understanding where two populations differ within a feature space, which we formalize in the concept of a differential subgroup: a subset of individuals from both populations who, despite sharing similar characteristics, exhibit exceptional differences in a target outcome. Differential subgroups reveal the regions of the feature space where population-level gaps are most pronounced and can help practitioners identify the covariate combinations that are structurally responsible for these differences, e.g.~in clinical analysis, model diagnostics, or treatment-effect studies. We introduce a general optimization objective for discovering differential subgroups and establish conditions under which the resulting subgroups admit a causal interpretation of population differences. We propose DiffSub, a gradient-based approach that discovers interpretable differential subgroups in tabular data. Across synthetic benchmarks, medical case studies, model-error analyses, and treatment-effect settings, DiffSub identifies informative subgroups that reveal where population differences arise and why.