CVCLLGDec 13, 2025

Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

arXiv:2512.12218v21 citations
Originality Incremental advance
AI Analysis

This addresses the reliability issue in multimodal reasoning for AI systems, though it is incremental as it builds on existing evaluation and self-reflection methods.

The paper tackles the problem of visual unfaithfulness in reasoning chains generated by vision-language models, where models may produce correct answers through incorrect intermediate steps. They introduce a training-free framework to evaluate and improve visual faithfulness, reducing the Unfaithful Perception Rate while maintaining final-answer accuracy across multiple benchmarks.

Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes