CLNov 7, 2024

Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

arXiv:2411.04637v36 citationsh-index: 12
Originality Synthesis-oriented
AI Analysis

This is an incremental tutorial for NLP practitioners in research and industry to optimize data labeling projects.

This tutorial addresses the problem of expensive and time-consuming human data labeling for machine learning by presenting strategies like synthetic data generation, active learning, and hybrid labeling to speed up annotation and reduce costs, with a focus on practical applications through case studies and a hands-on workshop.

Training and deploying machine learning models relies on a large amount of human-annotated data. As human labeling becomes increasingly expensive and time-consuming, recent research has developed multiple strategies to speed up annotation and reduce costs and human workload: generating synthetic training data, active learning, and hybrid labeling. This tutorial is oriented toward practical applications: we will present the basics of each strategy, highlight their benefits and limitations, and discuss in detail real-life case studies. Additionally, we will walk through best practices for managing human annotators and controlling the quality of the final dataset. The tutorial includes a hands-on workshop, where attendees will be guided in implementing a hybrid annotation setup. This tutorial is designed for NLP practitioners from both research and industry backgrounds who are involved in or interested in optimizing data labeling projects.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes