CLAILGFeb 11, 2023

DocILE Benchmark for Document Information Localization and Extraction

arXiv:2302.05658v258 citationsh-index: 40Has Code
AI Analysis

This addresses the problem of extracting structured information from diverse business documents for researchers and practitioners in document AI, though it is incremental as it builds on existing datasets and methods.

The paper introduces the DocILE benchmark, which provides a large dataset of business documents for key information localization and extraction and line item recognition, containing 6.7k annotated documents, 100k synthetic documents, and nearly 1M unlabeled documents, with baseline models achieving results that serve as a starting point for future research.

This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer; applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset, baselines and supplementary material are available at https://github.com/rossumai/docile.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes