CLCVApr 30, 2025

GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

Peking U
arXiv:2505.00063v22 citationsh-index: 33Has Code
Originality Incremental advance
AI Analysis

This provides a tool for researchers and developers to systematically assess and improve document AI models, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the lack of a comprehensive benchmark for evaluating multimodal large language models in document intelligence by introducing GDI-Bench, which includes 2.3k images across 19 tasks and decouples visual and reasoning complexities, and they proposed a GDI-Model that achieves state-of-the-art performance on this and previous benchmarks.

The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models' capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 2.3k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate various open-source and closed-source models on GDI-Bench, conducting decoupled analyses in the visual and reasoning domains, revealing their strengths and weaknesses. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI-Model that mitigates catastrophic forgetting during the supervised fine-tuning (SFT) process through an intelligence-preserving training strategy, thereby reinforcing the inherent weaknesses of the base model. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and models are or will be open-sourced on https://huggingface.co/GDIBench.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes