GR AI CL CVMar 13, 2025

Towards Understanding Graphical Perception in Large Multimodal Models

Kai Zhang, Jianwei Yang, Jeevana Priya Inala, Chandan Singh, Jianfeng Gao, Yu Su, Chenglong Wang

Microsoft

arXiv:2503.10857v14.35 citationsh-index: 42Has Code

Originality Incremental advance

AI Analysis

This work addresses a gap in evaluating perception abilities for researchers and developers of LMMs, providing a diagnostic tool to guide improvements.

The paper tackles the problem that large multimodal models (LMMs) struggle with simple perception tasks on infographics, despite excelling at complex vision-language tasks. By developing an evaluation framework based on graphical perception theory, the authors found critical limitations in state-of-the-art LMMs like GPT-4o, including inability to generalize across chart types, understand visual elements, and cross-reference values.

Despite the promising results of large multimodal models (LMMs) in complex vision-language tasks that require knowledge, reasoning, and perception abilities together, we surprisingly found that these models struggle with simple tasks on infographics that require perception only. As existing benchmarks primarily focus on end tasks that require various abilities, they provide limited, fine-grained insights into the limitations of the models' perception abilities. To address this gap, we leverage the theory of graphical perception, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data are publicly available at https://github.com/microsoft/lmm-graphical-perception.

View on arXiv PDF Code

Similar