PCA of probability measures: Sparse and Dense sampling regimes
This work addresses a gap in the literature for analyzing PCA with multiple probability measures, which is relevant for statistical learning and data analysis involving grouped or repeated observations.
The paper tackles the problem of performing PCA on multiple probability measures, each observed through samples, by deriving convergence rates for the empirical covariance operator and PCA excess risk as a function of the number of measures and samples per measure, revealing a sparse-to-dense transition and proving minimax optimality in the dense regime.
A common approach to perform PCA on probability measures is to embed them into a Hilbert space where standard functional PCA techniques apply. While convergence rates for estimating the embedding of a single measure from $m$ samples are well understood, the literature has not addressed the setting involving multiple measures. In this paper, we study PCA in a double asymptotic regime where $n$ probability measures are observed, each through $m$ samples. We derive convergence rates of the form $n^{-1/2} + m^{-α}$ for the empirical covariance operator and the PCA excess risk, where $α>0$ depends on the chosen embedding. This characterizes the relationship between the number $n$ of measures and the number $m$ of samples per measure, revealing a sparse (small $m$) to dense (large $m$) transition in the convergence behavior. Moreover, we prove that the dense-regime rate is minimax optimal for the empirical covariance error. Our numerical experiments validate these theoretical rates and demonstrate that appropriate subsampling preserves PCA accuracy while reducing computational cost.