CLAILGJan 21, 2025

Automatic Labelling with Open-source LLMs using Dynamic Label Schema Integration

arXiv:2501.12332v18 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses privacy and cost concerns in data labeling for machine learning projects, though it appears incremental as it builds on existing label schema integration approaches.

The paper tackles the problem of costly labeled data acquisition by proposing a method to use open-source LLMs for automatic labeling, achieving performance improvements through dynamic label schema integration.

Acquiring labelled training data remains a costly task in real world machine learning projects to meet quantity and quality requirements. Recently Large Language Models (LLMs), notably GPT-4, have shown great promises in labelling data with high accuracy. However, privacy and cost concerns prevent the ubiquitous use of GPT-4. In this work, we explore effectively leveraging open-source models for automatic labelling. We identify integrating label schema as a promising technology but found that naively using the label description for classification leads to poor performance on high cardinality tasks. To address this, we propose Retrieval Augmented Classification (RAC) for which LLM performs inferences for one label at a time using corresponding label schema; we start with the most related label and iterates until a label is chosen by the LLM. We show that our method, which dynamically integrates label description, leads to performance improvements in labelling tasks. We further show that by focusing only on the most promising labels, RAC can trade off between label quality and coverage - a property we leverage to automatically label our internal datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes