CLAILGJun 25, 2024

The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

arXiv:2407.11004v211 citations
Originality Highly original
AI Analysis

This addresses the problem of expensive and inflexible data annotation for machine learning practitioners, offering a cost-effective alternative to LLM-based annotation.

The paper tackles the high cost and static nature of using large pretrained models as data annotators by proposing a system that generates programs to produce labels instead, achieving comparable or better performance with a 12.9% average improvement and reducing labeling costs by approximately 500x.

Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately 500x.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes