LG AI MLJun 18, 2021

Dependency Structure Misspecification in Multi-Source Weak Supervision Models

Salva Rühling Cachay, Benedikt Boecking, Artur Dubrawski

arXiv:2106.10302v16.59 citations

Originality Incremental advance

AI Analysis

This addresses a critical awareness gap for practitioners in weak supervision, as ignoring dependency structures can degrade downstream classifier performance, though it is incremental in analyzing a specific type of misspecification.

The paper tackles the problem of label model misspecification in data programming, specifically analyzing how over-specifying dependency structures among labeling functions leads to modeling errors, and it derives theoretical bounds and shows empirically that these errors can be substantial.

Data programming (DP) has proven to be an attractive alternative to costly hand-labeling of data. In DP, users encode domain knowledge into \emph{labeling functions} (LF), heuristics that label a subset of the data noisily and may have complex dependencies. A label model is then fit to the LFs to produce an estimate of the unknown class label. The effects of label model misspecification on test set performance of a downstream classifier are understudied. This presents a serious awareness gap to practitioners, in particular since the dependency structure among LFs is frequently ignored in field applications of DP. We analyse modeling errors due to structure over-specification. We derive novel theoretical bounds on the modeling error and empirically show that this error can be substantial, even when modeling a seemingly sensible structure.

View on arXiv PDF

Similar