Semi-verified PAC Learning from the Crowd
This addresses the challenge of robust learning from unreliable crowdsourced data for applications like online platforms, though it is incremental as it builds on prior semi-verified models.
The paper tackles the problem of PAC learning threshold functions from crowdsourced labels where most workers may be adversarial, using a semi-verified model with limited access to a trusted oracle. It shows that learning is possible with manageable label queries and that labeling costs can be reduced via comparison queries, without relying on data distributional assumptions.
We study the problem of crowdsourced PAC learning of threshold functions. This is a challenging problem and only recently have query-efficient algorithms been established under the assumption that a noticeable fraction of the workers are perfect. In this work, we investigate a more challenging case where the majority may behave adversarially and the rest behave as the Massart noise - a significant generalization of the perfectness assumption. We show that under the {semi-verified model} of Charikar et al. (2017), where we have (limited) access to a trusted oracle who always returns correct annotations, it is possible to PAC learn the underlying hypothesis class with a manageable amount of label queries. Moreover, we show that the labeling cost can be drastically mitigated via the more easily obtained comparison queries. Orthogonal to recent developments in semi-verified or list-decodable learning that crucially rely on data distributional assumptions, our PAC guarantee holds by exploring the wisdom of the crowd.