Sample Complexity of Nonparametric Semi-Supervised Learning
This work addresses the theoretical challenge of sample efficiency in semi-supervised learning for researchers and practitioners, though it is incremental by extending binary classification results to multiclass settings.
The paper tackles the sample complexity of nonparametric semi-supervised learning for multiclass classification, establishing an Ω(K log K) labeled sample complexity bound and showing that near-optimal classifiers can be learned with few labeled samples.
We study the sample complexity of semi-supervised learning (SSL) and introduce new assumptions based on the mismatch between a mixture model learned from unlabeled data and the true mixture model induced by the (unknown) class conditional distributions. Under these assumptions, we establish an $Ω(K\log K)$ labeled sample complexity bound without imposing parametric assumptions, where $K$ is the number of classes. Our results suggest that even in nonparametric settings it is possible to learn a near-optimal classifier using only a few labeled samples. Unlike previous theoretical work which focuses on binary classification, we consider general multiclass classification ($K>2$), which requires solving a difficult permutation learning problem. This permutation defines a classifier whose classification error is controlled by the Wasserstein distance between mixing measures, and we provide finite-sample results characterizing the behaviour of the excess risk of this classifier. Finally, we describe three algorithms for computing these estimators based on a connection to bipartite graph matching, and perform experiments to illustrate the superiority of the MLE over the majority vote estimator.