Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
This addresses the reliability issue in multimodal reasoning for AI systems, though it is incremental as it builds on existing evaluation and self-reflection methods.
The paper tackles the problem of visual unfaithfulness in reasoning chains generated by vision-language models, where models may produce correct answers through incorrect intermediate steps. They introduce a training-free framework to evaluate and improve visual faithfulness, reducing the Unfaithful Perception Rate while maintaining final-answer accuracy across multiple benchmarks.
Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.