CVAIMar 21

Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning

arXiv:2603.2673754.6h-index: 2
AI Analysis

This addresses the limitation of static visual encoding in multimodal LLMs for visual reasoning tasks, though it appears incremental as it builds on existing CoT methods with a structured visual component.

The paper tackles the problem of multimodal LLMs lacking goal-driven and adaptive visual access by proposing Structural Sequential Visual CoT (SSV-CoT), which uses saliency maps to organize visual regions and performs reasoning in a curriculum-like order. Experiments on diverse visual reasoning benchmarks show gains, validating the approach.

Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes