MM AI CL CVOct 28, 2024

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Qintong Zhang, Bin Wang, Victor Shea-Jay Huang, Junyuan Zhang, Zhengren Wang, Hao Liang, Conghui He, Wentao Zhang

arXiv:2410.21169v417.852 citationsh-index: 30

Originality Synthesis-oriented

AI Analysis

It addresses the problem of extracting structured information from documents for applications such as knowledge base construction, but it is incremental as it synthesizes existing research.

This survey reviews document parsing techniques for converting unstructured documents into structured data, covering methodologies from modular pipelines to end-to-end models and discussing challenges like complex layouts and high-density text.

Document parsing is essential for converting unstructured and semi-structured documents such as contracts, academic papers, and invoices into structured, machine-readable data. Document parsing reliable structured data from unstructured inputs, providing huge convenience for numerous applications. Especially with recent achievements in Large Language Models, document parsing plays an indispensable role in both knowledge base construction and training data generation. This survey presents a comprehensive review of the current state of document parsing, covering key methodologies, from modular pipeline systems to end-to-end models driven by large vision-language models. Core components such as layout detection, content extraction (including text, tables, and mathematical expressions), and multi-modal data integration are examined in detail. Additionally, this paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts, integrating multiple modules, and recognizing high-density text. It outlines future research directions and emphasizes the importance of developing larger and more diverse datasets.

View on arXiv PDF

Similar