Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs
This work addresses the need for automated document processing in public sectors, but it is incremental as it applies existing annotation methods to a new domain-specific dataset.
The researchers tackled the challenge of automatically processing digital documents in public affairs by creating a novel database for Document Layout Analysis, comprising 37.9K documents with over 441K pages and 8M labels, and validated their text labeling procedure with up to 99% accuracy.
Every day, thousands of digital documents are generated with useful information for companies, public organizations, and citizens. Given the impossibility of processing them manually, the automatic processing of these documents is becoming increasingly necessary in certain sectors. However, this task remains challenging, since in most cases a text-only based parsing is not enough to fully understand the information presented through different components of varying significance. In this regard, Document Layout Analysis (DLA) has been an interesting research field for many years, which aims to detect and classify the basic components of a document. In this work, we used a procedure to semi-automatically annotate digital documents with different layout labels, including 4 basic layout blocks and 4 text categories. We apply this procedure to collect a novel database for DLA in the public affairs domain, using a set of 24 data sources from the Spanish Administration. The database comprises 37.9K documents with more than 441K document pages, and more than 8M labels associated to 8 layout block units. The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.