TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
This addresses the issue of silent reasoning errors in vision-language models for researchers and developers, offering a tool for debugging and improvement, though it is incremental as it builds on existing evaluation methods.
The paper tackles the problem of unreliable mathematical and scientific reasoning in large vision-language models by introducing TRACE, a framework that diagnoses reasoning trajectories through consistency-based metrics, showing that consistency correlates with final-answer correctness and helps pinpoint failure steps.
Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.