LGDec 1, 2022

Margin-based sampling in high dimensions: When being active is less efficient than staying passive

Alexandru Tifrea, Jacob Clarysse, Fanny Yang

arXiv:2212.00772v27.85 citationsh-index: 17

Originality Incremental advance

AI Analysis

This challenges a widely held belief in machine learning, showing that active learning can be less efficient than passive learning in high dimensions, which is significant for practitioners relying on active learning for data labeling efficiency.

The paper tackles the problem of margin-based active learning underperforming passive learning in high-dimensional settings, proving for logistic regression that passive learning outperforms active learning even with noiseless data and optimal sampling, and corroborating this with experiments on 20 diverse high-dimensional datasets.

It is widely believed that given the same labeling budget, active learning (AL) algorithms like margin-based active learning achieve better predictive performance than passive learning (PL), albeit at a higher computational cost. Recent empirical evidence suggests that this added cost might be in vain, as margin-based AL can sometimes perform even worse than PL. While existing works offer different explanations in the low-dimensional regime, this paper shows that the underlying mechanism is entirely different in high dimensions: we prove for logistic regression that PL outperforms margin-based AL even for noiseless data and when using the Bayes optimal decision boundary for sampling. Insights from our proof indicate that this high-dimensional phenomenon is exacerbated when the separation between the classes is small. We corroborate this intuition with experiments on 20 high-dimensional datasets spanning a diverse range of applications, from finance and histology to chemistry and computer vision.

View on arXiv PDF

Similar