Max-MIG: an Information Theoretic Approach for Joint Learning from Crowds
This solves a key challenge in crowdsourcing for machine learning, enabling more reliable label aggregation without structural assumptions, which is incremental but practical for real-world data annotation tasks.
The paper tackles the problem of learning from crowdsourced labels without prior knowledge of the information structure among crowd workers, addressing issues like correlated mistakes and random labeling. It proposes Max-MIG, an information-theoretic approach that simultaneously aggregates labels and learns a classifier, achieving state-of-the-art results in most settings and demonstrating robustness to various information structures.
Eliciting labels from crowds is a potential way to obtain large labeled data. Despite a variety of methods developed for learning from crowds, a key challenge remains unsolved: \emph{learning from crowds without knowing the information structure among the crowds a priori, when some people of the crowds make highly correlated mistakes and some of them label effortlessly (e.g. randomly)}. We propose an information theoretic approach, Max-MIG, for joint learning from crowds, with a common assumption: the crowdsourced labels and the data are independent conditioning on the ground truth. Max-MIG simultaneously aggregates the crowdsourced labels and learns an accurate data classifier. Furthermore, we devise an accurate data-crowds forecaster that employs both the data and the crowdsourced labels to forecast the ground truth. To the best of our knowledge, this is the first algorithm that solves the aforementioned challenge of learning from crowds. In addition to the theoretical validation, we also empirically show that our algorithm achieves the new state-of-the-art results in most settings, including the real-world data, and is the first algorithm that is robust to various information structures. Codes are available at \hyperlink{https://github.com/Newbeeer/Max-MIG}{https://github.com/Newbeeer/Max-MIG}