M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?
This work addresses a critical gap in evaluating LVLMs for multimodal document understanding, which is important for researchers and developers in AI and document processing, though it is incremental as it builds on existing LVLM capabilities with a new benchmark and baseline.
The paper tackles the problem of whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in documents by introducing a novel benchmark, M-DocSum-Bench, and finds that leading LVLMs struggle with coherence and accuracy, while their baseline model, M-DocSum-7B, achieves state-of-the-art performance compared to larger models like GPT-4o and Gemini Pro.
We investigate a critical yet under-explored question in Large Vision-Language Models (LVLMs): Do LVLMs genuinely comprehend interleaved image-text in the document? Existing document understanding benchmarks often assess LVLMs using question-answer formats, which are information-sparse and difficult to guarantee the coverage of long-range dependencies. To address this issue, we introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench), which comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences. M-DocSum-Bench is a reference-based generation task and necessitates the generation of interleaved image-text summaries using provided reference images, thereby simultaneously evaluating capabilities in understanding, reasoning, localization, and summarization within complex multimodal document scenarios. To facilitate this benchmark, we develop an automated framework to construct summaries and propose a fine-grained evaluation method called M-DocEval. Moreover, we further develop a robust summarization baseline, i.e., M-DocSum-7B, by progressive two-stage training with diverse instruction and preference data. The extensive results on our M-DocSum-Bench reveal that the leading LVLMs struggle to maintain coherence and accurately integrate information within long and interleaved contexts, often exhibiting confusion between similar images and a lack of robustness. Notably, M-DocSum-7B achieves state-of-the-art performance compared to larger and closed-source models (including GPT-4o, Gemini Pro, Claude-3.5-Sonnet and Qwen2.5-VL-72B, etc.), demonstrating the potential of LVLMs for improved interleaved image-text understanding. The code, data, and models are available at https://github.com/stepfun-ai/M-DocSum-Bench.