CLAILGSep 19, 2023

LMDX: Language Model-based Document Information Extraction and Localization

arXiv:2309.10952v261 citationsh-index: 37
Originality Highly original
AI Analysis

This addresses a core challenge in document processing workflows for industries handling semi-structured documents, offering a novel solution with high-quality, data-efficient extraction.

The paper tackles the problem of extracting information from visually rich documents using large language models (LLMs), which previously failed due to lack of layout encoding and grounding mechanisms, and introduces LMDX to reframe this task, achieving a new state-of-the-art on VRDU and CORD benchmarks.

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art and exhibiting emergent capabilities across various tasks. However, their application in extracting information from visually rich documents, which is at the core of many document processing workflows and involving the extraction of key entities from semi-structured documents, has not yet been successful. The main obstacles to adopting LLMs for this task include the absence of layout encoding within LLMs, which is critical for high quality extraction, and the lack of a grounding mechanism to localize the predicted entities within the document. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to reframe the document information extraction task for a LLM. LMDX enables extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. Finally, we apply LMDX to the PaLM 2-S and Gemini Pro LLMs and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes