AIApr 23

Can MLLMs "Read" What is Missing?

arXiv:2604.2127765.5

Predicted impact top 57% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers evaluating MLLMs, this benchmark provides a more direct assessment of layout understanding and visual grounding by isolating reconstruction from instruction-following.

The paper introduces MMTR-Bench, a benchmark for evaluating MLLMs' ability to reconstruct masked text from visual context without explicit prompts. Experiments show it poses a significant challenge, especially for sentence- and paragraph-level reconstruction.

We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.

View on arXiv PDF

Similar