Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son

arXiv:2605.1165187.71 citationsHas Code

Predicted impact top 18% in CV · last 90 daysOriginality Incremental advance

AI Analysis

It addresses the computational cost of large think-answer VLMs by distilling their reasoning capabilities into compact models, improving visual grounding during reasoning.

The paper introduces a think-answer distillation framework for VLMs that masks salient reasoning prefixes to force the student model to rely more on visual evidence, achieving state-of-the-art results on multimodal reasoning benchmarks.

Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, {measured by discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyses confirm enhanced visual utilization along the student thinking process.

View on arXiv PDF Code

Similar