AIJun 17, 2025

QUEST: Quality-aware Semi-supervised Table Extraction for Business Documents

arXiv:2506.14568v21 citationsh-index: 30ICDAR
Originality Incremental advance
AI Analysis

This addresses the challenge of sparse annotations and error-prone pipelines in industrial workflows for business document processing, offering an incremental improvement over existing semi-supervised methods.

The paper tackles the problem of automating table extraction from business documents by proposing QUEST, a quality-aware semi-supervised framework that improves F1 scores from 64% to 74% on a proprietary dataset and from 42% to 50% on the DocILE benchmark, while reducing empty predictions by 45% and 19%, respectively.

Automating table extraction (TE) from business documents is critical for industrial workflows but remains challenging due to sparse annotations and error-prone multi-stage pipelines. While semi-supervised learning (SSL) can leverage unlabeled data, existing methods rely on confidence scores that poorly reflect extraction quality. We propose QUEST, a Quality-aware Semi-supervised Table extraction framework designed for business documents. QUEST introduces a novel quality assessment model that evaluates structural and contextual features of extracted tables, trained to predict F1 scores instead of relying on confidence metrics. This quality-aware approach guides pseudo-label selection during iterative SSL training, while diversity measures (DPP, Vendi score, IntDiv) mitigate confirmation bias. Experiments on a proprietary business dataset (1000 annotated + 10000 unannotated documents) show QUEST improves F1 from 64% to 74% and reduces empty predictions by 45% (from 12% to 6.5%). On the DocILE benchmark (600 annotated + 20000 unannotated documents), QUEST achieves a 50% F1 score (up from 42%) and reduces empty predictions by 19% (from 27% to 22%). The framework's interpretable quality assessments and robustness to annotation scarcity make it particularly suited for business documents, where structural consistency and data completeness are paramount.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes