Mixture of Conditional Gaussian Graphical Models for unlabelled heterogeneous populations in the presence of co-factors
This addresses a specific issue in statistical modeling for researchers analyzing complex datasets with confounding variables, though it is incremental as it builds on existing mixture models by incorporating conditional adjustments.
The paper tackles the problem of identifying sub-populations in unlabelled heterogeneous data when external co-features scatter data points, disrupting traditional mixture models, by proposing a Mixture of Conditional Gaussian Graphical Models that subtracts heterogeneous co-feature effects, and demonstrates on synthetic and real data that it successfully identifies sub-populations where previous methods fail.
Conditional correlation networks, within Gaussian Graphical Models (GGM), are widely used to describe the direct interactions between the components of a random vector. In the case of an unlabelled Heterogeneous population, Expectation Maximisation (EM) algorithms for Mixtures of GGM have been proposed to estimate both each sub-population's graph and the class labels. However, we argue that, with most real data, class affiliation cannot be described with a Mixture of Gaussian, which mostly groups data points according to their geometrical proximity. In particular, there often exists external co-features whose values affect the features' average value, scattering across the feature space data points belonging to the same sub-population. Additionally, if the co-features' effect on the features is Heterogeneous, then the estimation of this effect cannot be separated from the sub-population identification. In this article, we propose a Mixture of Conditional GGM (CGGM) that subtracts the heterogeneous effects of the co-features to regroup the data points into sub-population corresponding clusters. We develop a penalised EM algorithm to estimate graph-sparse model parameters. We demonstrate on synthetic and real data how this method fulfils its goal and succeeds in identifying the sub-populations where the Mixtures of GGM are disrupted by the effect of the co-features.