LGMLJun 15, 2018

On the Relationship between Data Efficiency and Error for Uncertainty Sampling

arXiv:1806.06123v137 citations
Originality Incremental advance
AI Analysis

This addresses the mixed practical data efficiency of active learning for researchers and practitioners, providing insights into its effectiveness, though it is incremental as it focuses on a specific algorithm and model.

The paper investigates when active learning is helpful, finding empirically on 21 OpenML datasets that data efficiency strongly inversely correlates with error rate, and theoretically showing asymptotic data efficiency is within a constant factor of the inverse error rate for a variant of uncertainty sampling.

While active learning offers potential cost savings, the actual data efficiency---the reduction in amount of labeled data needed to obtain the same error rate---observed in practice is mixed. This paper poses a basic question: when is active learning actually helpful? We provide an answer for logistic regression with the popular active learning algorithm, uncertainty sampling. Empirically, on 21 datasets from OpenML, we find a strong inverse correlation between data efficiency and the error rate of the final classifier. Theoretically, we show that for a variant of uncertainty sampling, the asymptotic data efficiency is within a constant factor of the inverse error rate of the limiting classifier.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes