CLNov 5, 2025

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

arXiv:2511.03146v12 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work addresses the problem of insufficient cognitive capacity assessment in multimodal AI for researchers and developers, though it is incremental as it builds on existing benchmarking efforts.

The authors tackled the lack of vision-centric cognitive evaluation for multimodal large language models by introducing MME-CC, a benchmark with 11 tasks across spatial, geometric, and knowledge-based reasoning, revealing that closed-source models outperform open-source ones (e.g., 42.66 vs. 30.45) and spatial/geometric reasoning scores are low (≤30%).

As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes