Radically Lower Data-Labeling Costs for Visually Rich Document Extraction Models
This addresses the data-labeling bottleneck for businesses and researchers working with document extraction, though it is an incremental improvement on existing active learning methods.
The paper tackles the high cost of labeling visually rich documents for extraction models by introducing Selective Labeling, which simplifies labeling to yes/no decisions on model predictions and uses active learning to target uncertain cases. Experiments across three domains show this approach reduces labeling costs by 10x with minimal accuracy loss.
A key bottleneck in building automatic extraction models for visually rich documents like invoices is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. We propose Selective Labeling to simplify the labeling task to provide "yes/no" labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by $10\times$ with a negligible loss in accuracy.