CVOct 27, 2025

PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan

arXiv:2510.23594v211.84 citationsh-index: 20

Originality Incremental advance

AI Analysis

This work addresses the need for diagnostic evaluation protocols to improve reasoning reliability in MLLMs, which is crucial for developing trustworthy AI systems, though it is incremental as it builds on existing benchmark and error detection concepts.

The authors tackled the problem of unreliable reasoning in multimodal large language models (MLLMs) by introducing PRISM-Bench, a benchmark of puzzle-based visual tasks that requires models to detect errors in step-by-step reasoning chains, revealing a persistent gap between fluent generation and faithful reasoning in state-of-the-art models.

Multimodal large language models (MLLMs) have achieved remarkable progress on vision-language tasks, yet their reasoning processes remain sometimes unreliable. We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.

View on arXiv PDF

Similar