CLAILGMar 30, 2024

Noise-Aware Training of Layout-Aware Language Models

arXiv:2404.00488v1h-index: 13
Originality Highly original
AI Analysis

This addresses the scalability bottleneck for enterprises needing to train extractors for thousands of document types, though it is incremental as it builds on existing extractor models with a novel training method.

The paper tackles the problem of training custom named entity extractors for visually rich documents without expensive human-labeled data, proposing Noise-Aware Training (NAT) that uses weakly labeled documents and incorporates confidence estimates to handle noise, resulting in up to 6% higher macro-F1 scores and up to 73% reduction in human effort compared to baselines.

A visually rich document (VRD) utilizes visual features along with linguistic cues to disseminate information. Training a custom extractor that identifies named entities from a document requires a large number of instances of the target document type annotated at textual and visual modalities. This is an expensive bottleneck in enterprise scenarios, where we want to train custom extractors for thousands of different document types in a scalable way. Pre-training an extractor model on unlabeled instances of the target document type, followed by a fine-tuning step on human-labeled instances does not work in these scenarios, as it surpasses the maximum allowable training time allocated for the extractor. We address this scenario by proposing a Noise-Aware Training method or NAT in this paper. Instead of acquiring expensive human-labeled documents, NAT utilizes weakly labeled documents to train an extractor in a scalable way. To avoid degradation in the model's quality due to noisy, weakly labeled samples, NAT estimates the confidence of each training sample and incorporates it as uncertainty measure during training. We train multiple state-of-the-art extractor models using NAT. Experiments on a number of publicly available and in-house datasets show that NAT-trained models are not only robust in performance -- it outperforms a transfer-learning baseline by up to 6% in terms of macro-F1 score, but it is also more label-efficient -- it reduces the amount of human-effort required to obtain comparable performance by up to 73%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes