AIMar 30

CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

Yongkang Du, Xiaohan Zou, Minhao Cheng, Lu Lin

arXiv:2603.2795858.6h-index: 3

AI Analysis

This addresses a critical gap in assessing higher-order intelligence in MLLMs for researchers and developers, though it is incremental as it builds on existing analogical reasoning evaluations.

The paper tackles the problem of evaluating compositional analogical reasoning in multimodal large language models (MLLMs) by introducing CARV, a diagnostic benchmark with a 5,500-sample dataset, and finds that state-of-the-art MLLMs like Gemini-2.5 Pro achieve only 40.4% accuracy, far below human-level performance of 100%.

Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, which requires MLLMs to extract symbolic rules from each pair and compose new transformations. Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual changes into symbolic rules, and (2) maintaining robustness under diverse or complex settings, highlighting the limitations of current MLLMs on this task.

View on arXiv PDF

Similar