CLNov 9, 2022

DoSA : A System to Accelerate Annotations on Business Documents with Human-in-the-Loop

arXiv:2211.04934v1290 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of costly and time-consuming manual annotation for document-specific models in business applications, though it is incremental as it builds on existing human-in-the-loop and bootstrapping techniques.

The paper tackles the challenge of information extraction from diverse business documents by introducing DoSA, a system that accelerates annotation using a human-in-the-loop bootstrap approach, resulting in an open-source implementation available on GitHub.

Business documents come in a variety of structures, formats and information needs which makes information extraction a challenging task. Due to these variations, having a document generic model which can work well across all types of documents and for all the use cases seems far-fetched. For document-specific models, we would need customized document-specific labels. We introduce DoSA (Document Specific Automated Annotations), which helps annotators in generating initial annotations automatically using our novel bootstrap approach by leveraging document generic datasets and models. These initial annotations can further be reviewed by a human for correctness. An initial document-specific model can be trained and its inference can be used as feedback for generating more automated annotations. These automated annotations can be reviewed by human-in-the-loop for the correctness and a new improved model can be trained using the current model as pre-trained model before going for the next iteration. In this paper, our scope is limited to Form like documents due to limited availability of generic annotated datasets, but this idea can be extended to a variety of other documents as more datasets are built. An open-source ready-to-use implementation is made available on GitHub https://github.com/neeleshkshukla/DoSA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes