Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget
This work addresses a practical issue for researchers and practitioners in machine learning by optimizing label allocation for classifier comparison, though it is incremental as it builds on existing statistical theory.
The paper tackles the problem of efficiently comparing two binary classifiers under a fixed labeling budget, proving that collecting single labels for more samples is more effective than aggregating multiple noisy labels per sample via majority vote. The result provides superior sample size bounds compared to Hoeffding's bound and has implications for machine learning benchmark design.
We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cramér's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound.