LG OC MLJun 25, 2021

Self-training Converts Weak Learners to Strong Learners in Mixture Models

Spencer Frei, Difan Zou, Zixiang Chen, Quanquan Gu

arXiv:2106.13805v314.624 citations

Originality Incremental advance

AI Analysis

This provides a theoretical foundation for semi-supervised learning in mixture models, demonstrating efficient use of unlabeled data to reduce labeled data requirements, though it is incremental as it builds on existing self-training and mixture model frameworks.

The paper tackles the problem of binary classification in mixture models by showing that self-training can convert a weak learner to a strong learner using unlabeled data, achieving Bayes-optimal classifier accuracy up to ε error with O(d) labeled examples and Õ(d/ε²) unlabeled examples.

We consider a binary classification problem when the data comes from a mixture of two rotationally symmetric distributions satisfying concentration and anti-concentration properties enjoyed by log-concave distributions among others. We show that there exists a universal constant $C_{\mathrm{err}}>0$ such that if a pseudolabeler $\boldsymbolβ_{\mathrm{pl}}$ can achieve classification error at most $C_{\mathrm{err}}$, then for any $\varepsilon>0$, an iterative self-training algorithm initialized at $\boldsymbolβ_0 := \boldsymbolβ_{\mathrm{pl}}$ using pseudolabels $\hat y = \mathrm{sgn}(\langle \boldsymbolβ_t, \mathbf{x}\rangle)$ and using at most $\tilde O(d/\varepsilon^2)$ unlabeled examples suffices to learn the Bayes-optimal classifier up to $\varepsilon$ error, where $d$ is the ambient dimension. That is, self-training converts weak learners to strong learners using only unlabeled examples. We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler $\boldsymbolβ_{\mathrm{pl}}$ with classification error $C_{\mathrm{err}}$ using only $O(d)$ labeled examples (i.e., independent of $\varepsilon$). Together our results imply that mixture models can be learned to within $\varepsilon$ of the Bayes-optimal accuracy using at most $O(d)$ labeled examples and $\tilde O(d/\varepsilon^2)$ unlabeled examples by way of a semi-supervised self-training algorithm.

View on arXiv PDF

Similar