CVDec 4, 2025

Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

arXiv:2512.05091v15 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses the lack of interpretability in visual reasoning for AI researchers, though it is incremental as it builds on existing MLLM capabilities by adding a new benchmark and dataset.

The paper tackles the problem of opaque reasoning processes in Multimodal Large Language Models (MLLMs) by introducing the Visual Reasoning Tracer (VRT) task, which requires models to localize target objects and predict intermediate reasoning paths, and shows that models trained on their VRT-80k dataset achieve substantial improvements in tracing these paths.

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes