ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying
This addresses a bottleneck in vision-language reasoning for AI applications, offering an incremental advance over existing methods.
The paper tackles the problem of vision-language models struggling with Chain-of-Thought reasoning due to passive processing of visual information, introducing ViThinker to enable active querying of visual features, resulting in consistent improvements in perceptual grounding and reasoning accuracy across benchmarks.
Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning task-driven querying via sparsity penalties, i.e., ViThinker discovers minimal sufficient perception for each reasoning step. Evaluations across vision-centric benchmarks demonstrate consistent improvements, validating that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.