LG AI CL CVMay 22, 2025

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin

arXiv:2505.17163v120 citationsh-index: 20Has Code

Originality Incremental advance

AI Analysis

This work addresses a critical gap for researchers and developers in AI by providing a benchmark to assess MLLMs' reasoning capabilities in text-rich visual scenarios, though it is incremental as it builds on existing multimodal systems.

The authors tackled the lack of a systematic benchmark for evaluating multimodal large language models (MLLMs) on text-rich image reasoning tasks by proposing OCR-Reasoning, a comprehensive benchmark with 1,069 examples across 6 reasoning abilities and 18 tasks. Their evaluation revealed that even state-of-the-art MLLMs struggle significantly, with none achieving accuracy above 50% on this benchmark.

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50\% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

View on arXiv PDF Code

Similar