LGFeb 17, 2025

ScriptoriumWS: A Code Generation Assistant for Weak Supervision

Tzu-Heng Huang, Catherine Cao, Spencer Schoenberg, Harit Vishwakarma, Nicholas Roberts, Frederic Sala

arXiv:2502.12366v116.910 citationsh-index: 9

Originality Incremental advance

AI Analysis

This addresses the labeled data bottleneck for machine learning practitioners by reducing the cost of creating weak supervision sources, though it is incremental as it builds on existing weak supervision frameworks.

The paper tackles the problem of expensive hand-crafted weak supervision sources by using code-generation models as assistants, resulting in a system that maintains accuracy and greatly improves coverage compared to hand-crafted sources.

Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.

View on arXiv PDF

Similar