MLDIS-NNLGJan 27, 2025

The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model

arXiv:2501.16226v44 citationsh-index: 1Has Code
Originality Incremental advance
AI Analysis

This work addresses the unclear mechanisms of self-distillation for practitioners in noisy classification tasks, offering incremental theoretical and practical insights.

The study investigated self-distillation's effectiveness in noisy binary classification using Gaussian mixture data, finding that denoising via hard pseudo-labels drives performance gains, with up to 10% improvement in moderately sized datasets, and proposed heuristics like early stopping and bias parameter fixing.

Self-distillation (SD), a technique where a model improves itself using its own predictions, has attracted attention as a simple yet powerful approach in machine learning. Despite its widespread use, the mechanisms underlying its effectiveness remain unclear. In this study, we investigate the efficacy of hyperparameter-tuned multi-stage SD with a linear classifier for binary classification on noisy Gaussian mixture data. For the analysis, we employ the replica method from statistical physics. Our findings reveal that the primary driver of SD's performance improvement is denoising through hard pseudo-labels, with the most notable gains observed in moderately sized datasets. We also identify two practical heuristics to enhance SD: early stopping that limits the number of stages, which is broadly effective, and bias parameter fixing, which helps under label imbalance. To empirically validate our theoretical findings derived from our toy model, we conduct additional experiments on CIFAR-10 classification using pretrained ResNet backbone. These results provide both theoretical and practical insights, advancing our understanding and application of SD in noisy settings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes