CLAIMMJul 2, 2024

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

arXiv:2407.01976v365 citationsh-index: 15Has Code
Originality Incremental advance
AI Analysis

This work addresses limitations in document understanding for tasks like KIE and VQA by improving layout-text integration, though it appears incremental as it builds on existing OCR-based LLM approaches.

The paper tackles the problem of integrating spatial layouts with text in document understanding by introducing LayTextLLM, which projects bounding boxes to single embeddings and interleaves them with text, resulting in a 15.2% increase on KIE tasks and 10.7% on VQA tasks compared to previous SOTA methods.

Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout and Text in a Large Language Model (LayTextLLM)} for document understanding. LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in KIE and VQA. Comprehensive benchmark evaluations reveal significant improvements of LayTextLLM, with a 15.2% increase on KIE tasks and 10.7% on VQA tasks compared to previous SOTA OCR-based LLMs. All resources are available at https://github.com/LayTextLLM/LayTextLLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes