LG AP MLOct 7, 2021

Creating Training Sets via Weak Indirect Supervision

Jieyu Zhang, Bohan Wang, Xiangchen Song, Yujing Wang, Yaming Yang, Jing Bai, Alexander Ratner

arXiv:2110.03484v313.617 citations

Originality Incremental advance

AI Analysis

This addresses the bottleneck of data labeling for machine learning practitioners by extending weak supervision to indirect sources, though it is incremental as it builds on existing weak supervision frameworks.

The paper tackles the problem of creating labeled training sets by introducing Weak Indirect Supervision (WIS), which synthesizes labels from noisy sources with different output spaces, and demonstrates that their PLRM method outperforms baselines by 2%-9% on image and text classification tasks and an industrial application.

Creating labeled training sets has become one of the major roadblocks in machine learning. To address this, recent \emph{Weak Supervision (WS)} frameworks synthesize training labels from multiple potentially noisy supervision sources. However, existing frameworks are restricted to supervision sources that share the same output space as the target task. To extend the scope of usable sources, we formulate Weak Indirect Supervision (WIS), a new research problem for automatically synthesizing training labels based on indirect supervision sources that have different output label spaces. To overcome the challenge of mismatched output spaces, we develop a probabilistic modeling approach, PLRM, which uses user-provided label relations to model and leverage indirect supervision sources. Moreover, we provide a theoretically-principled test of the distinguishability of PLRM for unseen labels, along with a generalization bound. On both image and text classification tasks as well as an industrial advertising application, we demonstrate the advantages of PLRM by outperforming baselines by a margin of 2%-9%.

View on arXiv PDF

Similar