ML LGApr 11, 2022

On unsupervised projections and second order signals

arXiv:2204.05139v13.81 citationsh-index: 25

Originality Incremental advance

AI Analysis

This addresses a problem in biomedicine and other fields where group differences in covariance or graphical models are important, but it is incremental as it extends prior work on mean differences to second-order signals.

The paper investigates whether linear projections like PCA and random projections can preserve differences in second-order structure (e.g., covariance) between latent groups in unsupervised settings, finding that PCA is more effective than random projections and often competitive with supervised methods across diverse data regimes.

Linear projections are widely used in the analysis of high-dimensional data. In unsupervised settings where the data harbour latent classes/clusters, the question of whether class discriminatory signals are retained under projection is crucial. In the case of mean differences between classes, this question has been well studied. However, in many contemporary applications, notably in biomedicine, group differences at the level of covariance or graphical model structure are important. Motivated by such applications, in this paper we ask whether linear projections can preserve differences in second order structure between latent groups. We focus on unsupervised projections, which can be computed without knowledge of class labels. We discuss a simple theoretical framework to study the behaviour of such projections which we use to inform an analysis via quasi-exhaustive enumeration. This allows us to consider the performance, over more than a hundred thousand sets of data-generating population parameters, of two popular projections, namely random projections (RP) and Principal Component Analysis (PCA). Across this broad range of regimes, PCA turns out to be more effective at retaining second order signals than RP and is often even competitive with supervised projection. We complement these results with fully empirical experiments showing 0-1 loss using simulated and real data. We study also the effect of projection dimension, drawing attention to a bias-variance trade-off in this respect. Our results show that PCA can indeed be a suitable first-step for unsupervised analysis, including in cases where differential covariance or graphical model structure are of interest.

View on arXiv PDF

Similar