ML AI LGJun 6, 2022

Training Subset Selection for Weak Supervision

Hunter Lang, Aravindan Vijayaraghavan, David Sontag

MIT

arXiv:2206.02914v218.627 citationsh-index: 52Has Code

Originality Incremental advance

AI Analysis

This addresses a key efficiency and performance issue in weak supervision pipelines for machine learning practitioners, offering a simple plug-in solution.

The paper tackles the problem of suboptimal performance in weak supervision by showing that using all weakly-labeled data is not always best, and introduces a subset selection method that improves accuracy by up to 19% on benchmark tasks.

Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.

View on arXiv PDF Code

Similar