CLOct 15, 2021

Clean or Annotate: How to Spend a Limited Data Collection Budget

arXiv:2110.08355v230.3630 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficiently using limited data collection budgets for machine learning practitioners, offering an incremental improvement over prior strategies for handling label noise.

The paper tackles the problem of managing noisy labels in crowdsourced datasets under a limited annotation budget by proposing a hybrid approach that allocates most of the budget to initial labeling and uses the remainder to explicitly clean high-error samples, resulting in performance that matches or exceeds existing methods like label aggregation and denoising across multiple NLP tasks.

Crowdsourcing platforms are often used to collect datasets for training machine learning models, despite higher levels of inaccurate labeling compared to expert labeling. There are two common strategies to manage the impact of such noise. The first involves aggregating redundant annotations, but comes at the expense of labeling substantially fewer examples. Secondly, prior works have also considered using the entire annotation budget to label as many examples as possible and subsequently apply denoising algorithms to implicitly clean the dataset. We find a middle ground and propose an approach which reserves a fraction of annotations to explicitly clean up highly probable error samples to optimize the annotation process. In particular, we allocate a large portion of the labeling budget to form an initial dataset used to train a model. This model is then used to identify specific examples that appear most likely to be incorrect, which we spend the remaining budget to relabel. Experiments across three model variations and four natural language processing tasks show our approach outperforms or matches both label aggregation and advanced denoising methods designed to handle noisy labels when allocated the same finite annotation budget.

View on arXiv PDF

Similar