DocParser: Hierarchical Structure Parsing of Document Renderings
This addresses a practical need in real-world applications for automated document analysis, though it is incremental with a focus on improving performance in data-scarce settings.
The paper tackles the problem of parsing hierarchical document structures from renderings like PDFs, developing an end-to-end system called DocParser that includes text elements, figures, and tables, and shows that a novel weak supervision approach improves entity detection by 39.1% and relation classification by 35.8%.
Translating renderings (e. g. PDFs, scans) into hierarchical document structures is extensively demanded in the daily routines of many real-world applications. However, a holistic, principled approach to inferring the complete hierarchical structure of documents is missing. As a remedy, we developed "DocParser": an end-to-end system for parsing the complete document structure - including all text elements, nested figures, tables, and table cell structures. Our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-specific data are scarce, which we address by a novel approach to weak supervision that significantly improves the document structure parsing performance. Our experiments confirm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 39.1 % and improves the F1 score of classifying hierarchical relations by 35.8 %.