LGMLJun 26, 2019

Near Optimal Stratified Sampling

arXiv:1906.11289v23 citations
AI Analysis

This work addresses the practical challenge of reducing labeling costs for ML evaluation, though it is incremental as it builds on existing stratified sampling methods by removing unrealistic assumptions.

The paper tackles the problem of expensive ground truth labeling for machine learning evaluation by proposing two new stratified sampling algorithms that simultaneously estimate statistical properties across strata and optimize evaluation accuracy, achieving near-optimal label complexity reduction as shown through theoretical lower bounds and experiments on synthetic and real data.

The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits statistical properties (e.g., variance) across strata of the unlabeled population, though usually under the unrealistic assumption that these properties are known. We propose two new algorithms that simultaneously estimate these properties and optimize the evaluation accuracy. We construct a lower bound to show the proposed algorithms (to log-factors) are rate optimal. Experiments on synthetic and real data show the reduction in label complexity that is enabled by our algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes