LG AI IT MLJun 30, 2016

A Permutation-based Model for Crowd Labeling: Optimal Estimation and Robustness

Nihar B. Shah, Sivaraman Balakrishnan, Martin J. Wainwright

arXiv:1606.09632v313.147 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving data quality from crowdsourcing platforms, which is crucial for applications relying on large-scale human annotations, though it appears incremental as it builds upon existing models.

The authors tackled the problem of aggregating noisy crowd-labeled data by proposing a permutation-based model that generalizes the classical Dawid-Skene model, deriving sharp global minimax rates and designing efficient estimators with non-asymptotic performance bounds validated through synthetic and real-world experiments.

The task of aggregating and denoising crowd-labeled data has gained increased significance with the advent of crowdsourcing platforms and massive datasets. We propose a permutation-based model for crowd labeled data that is a significant generalization of the classical Dawid-Skene model, and introduce a new error metric by which to compare different estimators. We derive global minimax rates for the permutation-based model that are sharp up to logarithmic factors, and match the minimax lower bounds derived under the simpler Dawid-Skene model. We then design two computationally-efficient estimators: the WAN estimator for the setting where the ordering of workers in terms of their abilities is approximately known, and the OBI-WAN estimator where that is not known. For each of these estimators, we provide non-asymptotic bounds on their performance. We conduct synthetic simulations and experiments on real-world crowdsourcing data, and the experimental results corroborate our theoretical findings.

View on arXiv PDF

Similar