CVCLMay 8, 2025

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

arXiv:2505.05446v15 citationsh-index: 16Has CodeCVPR
Originality Highly original
AI Analysis

This work addresses the problem of limited contextual information and hallucinations in visual document understanding for applications handling text-rich visual content, representing a strong specific gain in the domain.

The paper tackles the challenge of integrating visual perception and textual comprehension in Visual Document Understanding by proposing an adaptive markup language generation pipeline, resulting in a model that significantly outperforms existing state-of-the-art MLLMs across benchmarks.

Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following. Extensive experiments demonstrate that our proposed model significantly outperforms existing state-of-theart MLLMs across a range of visual document understanding benchmarks, facilitating advanced reasoning and comprehension capabilities in complex visual scenarios. Our code and models are released at https://github. com/Euphoria16/DocMark.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes