ML LGJun 12, 2025

Probably Approximately Correct Labels

Emmanuel J. Candès, Andrew Ilyas, Tijana Zrnic

arXiv:2506.10908v320.07 citationsh-index: 18Has Code

Originality Highly original

AI Analysis

This addresses the need for cost-effective, high-quality labeled datasets in various domains, offering a rigorous alternative to manual annotation.

The paper tackles the problem of costly manual labeling by proposing a method to use pre-trained AI models for dataset curation, resulting in probably approximately correct labels with small overall error and high probability, demonstrated across text, image, and protein folding tasks.

Obtaining high-quality labeled datasets is often costly, requiring either human annotation or expensive experiments. In theory, powerful pre-trained AI models provide an opportunity to automatically label datasets and save costs. Unfortunately, these models come with no guarantees on their accuracy, making wholesale replacement of manual labeling impractical. In this work, we propose a method for leveraging pre-trained AI models to curate cost-effective and high-quality datasets. In particular, our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. Our method is nonasymptotically valid under minimal assumptions on the dataset or the AI model being studied, and thus enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.

View on arXiv PDF Code

Similar