Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models

arXiv:2605.1777854.9
Predicted impact top 19% in ST · last 90 daysOriginality Incremental advance
AI Analysis

Provides theoretical foundations for self-distillation in high-dimensional statistics, offering optimality guarantees for practitioners using spectral shrinkage methods.

This paper establishes that in spiked covariance models, s-step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators, and that s steps are necessary for optimality. It also shows that for isotropic covariances, optimally tuned Ridge regression is optimal, and extends results to federated settings.

Self-distillation has emerged as a promising technique for improving model performance in modern machine learning systems. We develop the statistical foundations of self-distillation in spiked covariance models, by introducing and analyzing a broad class of estimators, namely spectral shrinkage estimators. We establish that for spiked covariance matrices with $s$ spikes, $s$-step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators in statistics and machine learning. Moreover, we show that $s$ steps are necessary for optimality: any $(s-k)$-step distilled estimator is strictly suboptimal for $1 \leq k \leq s$. For the special subclass of isotropic covariances, we show that optimally tuned Ridge regression performs best among spectral shrinkage estimators. We also study a federated approach where multiple data centers share spectral shrinkage estimators and a common server seeks to aggregate them to achieve optimal performance. In this case, we find that the best local rule again takes the form of self-distillation, though it differs from the optimal rule when data are hosted centrally on a single server. Together, our results elucidate why self-distillation improves predictive performance and provide a broader statistical framework connecting it with classical shrinkage-based methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes