LGNov 26, 2024

Learning from Noisy Labels via Conditional Distributionally Robust Optimization

arXiv:2411.17113v111.56 citationsh-index: 3Has CodeNIPS

Originality Incremental advance

AI Analysis

This addresses the challenge of noisy annotations in machine learning for practitioners using crowdsourced data, though it appears incremental as it builds on existing distributionally robust optimization frameworks.

The paper tackles the problem of learning from noisy labels in crowdsourced datasets by proposing a conditional distributionally robust optimization (CDRO) approach that minimizes worst-case risk within an ambiguity set, resulting in a robust pseudo-labeling algorithm that outperforms existing methods on synthetic and real-world datasets.

While crowdsourcing has emerged as a practical solution for labeling large datasets, it presents a significant challenge in learning accurate models due to noisy labels from annotators with varying levels of expertise. Existing methods typically estimate the true label posterior, conditioned on the instance and noisy annotations, to infer true labels or adjust loss functions. These estimates, however, often overlook potential misspecification in the true label posterior, which can degrade model performances, especially in high-noise scenarios. To address this issue, we investigate learning from noisy annotations with an estimated true label posterior through the framework of conditional distributionally robust optimization (CDRO). We propose formulating the problem as minimizing the worst-case risk within a distance-based ambiguity set centered around a reference distribution. By examining the strong duality of the formulation, we derive upper bounds for the worst-case risk and develop an analytical solution for the dual robust risk for each data point. This leads to a novel robust pseudo-labeling algorithm that leverages the likelihood ratio test to construct a pseudo-empirical distribution, providing a robust reference probability distribution in CDRO. Moreover, to devise an efficient algorithm for CDRO, we derive a closed-form expression for the empirical robust risk and the optimal Lagrange multiplier of the dual problem, facilitating a principled balance between robustness and model fitting. Our experimental results on both synthetic and real-world datasets demonstrate the superiority of our method.

View on arXiv PDF Code

Similar