CVCLJul 1, 2024

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Peking UStanford
arXiv:2407.01523v3136 citationsh-index: 63
Originality Incremental advance
AI Analysis

This work addresses the need for better evaluation of long-context document understanding in AI models, which is crucial for applications like document analysis, but it is incremental as it builds on existing benchmarking efforts.

The paper tackles the problem of evaluating long-context document understanding in large vision-language models by introducing MMLongBench-Doc, a benchmark with 1,062 expert-annotated questions based on lengthy PDF documents, and finds that current models struggle, with the best-performing GPT-4o achieving only a 42.7% F1 score.

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple pages. 22.8% of the questions are designed to be unanswerable for detecting potential hallucinations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing model, GPT-4o, achieves an F1 score of only 42.7%, while the second-best, GPT-4V, scores 31.4%. Furthermore, 12 LVLMs (all except GPT-4o and GPT-4V) even present worse performance than their LLM counterparts which are fed with lossy-parsed OCR documents. These results validate the necessity of future research toward more capable long-context LVLMs. Project Page: https://mayubo2333.github.io/MMLongBench-Doc

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes