DS AI DM COOct 27, 2021

Active clustering for labeling training data

Quentin Lutz, Élie de Panafieu, Alex Scott, Maya Stein

arXiv:2110.14521v16 citations

Originality Incremental advance

AI Analysis

This work addresses the high cost of data labeling in supervised learning, which is a critical bottleneck for practitioners, though it appears incremental as it builds on existing active learning and clustering concepts.

The paper tackles the problem of reducing the cost of labeling training data by proposing a setting where human experts answer pairwise queries instead of providing direct labels, and the computer groups items into classes. They analyze two random models for class generation, characterize algorithms that minimize the average number of queries, and propose solutions for handling errors in expert answers.

Gathering training data is a key step of any supervised learning task, and it is both critical and expensive. Critical, because the quantity and quality of the training data has a high impact on the performance of the learned function. Expensive, because most practical cases rely on humans-in-the-loop to label the data. The process of determining the correct labels is much more expensive than comparing two items to see whether they belong to the same class. Thus motivated, we propose a setting for training data gathering where the human experts perform the comparatively cheap task of answering pairwise queries, and the computer groups the items into classes (which can be labeled cheaply at the very end of the process). Given the items, we consider two random models for the classes: one where the set partition they form is drawn uniformly, the other one where each item chooses its class independently following a fixed distribution. In the first model, we characterize the algorithms that minimize the average number of queries required to cluster the items and analyze their complexity. In the second model, we analyze a specific algorithm family, propose as a conjecture that they reach the minimum average number of queries and compare their performance to a random approach. We also propose solutions to handle errors or inconsistencies in the experts' answers.

View on arXiv PDF

Similar