CVApr 2

Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning

Seyed Amir Kasaei, Arash Marioriyad, Mahbod Khaleti, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

arXiv:2604.0176412.0h-index: 20

Predicted impact top 68% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This addresses a critical gap in AI evaluation for neurosymbolic reasoning, though it is incremental as it focuses on benchmarking rather than solving the problem.

The paper tackles the problem of evaluating cognitive visual reasoning in Large Vision-Language Models (LVLMs) by introducing RebusBench, a benchmark of 1,164 rebus puzzles, and finds that state-of-the-art models perform poorly with less than 10% Exact Match and 20% semantic accuracy, showing no improvement from scaling or In-Context Learning.

Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.

View on arXiv PDF

Similar