CVMay 3

Act2See: Emergent Active Visual Perception for Video Reasoning

arXiv:2605.0165799.0
Predicted impact top 2% in CV · last 90 daysOriginality Highly original
AI Analysis

For video reasoning tasks, Act2See addresses the limitation of static initial frames by enabling active visual perception, improving reasoning quality.

Act2See enables VLMs to actively interleave video frames within text CoTs, achieving SOTA on VideoEspresso and ViTIB and outperforming comparable/larger models on Video-MME, EgoNormia, and VCR-Bench.

Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This approach cultivates an emergent capability: at inference time, the model actively determines when to search for or synthesize the necessary visual evidence. Act2See establishes new state-of-the-art results on challenging benchmarks, including VideoEspresso and ViTIB, and outperforms comparable or larger models on Video-MME, EgoNormia, and VCR-Bench, demonstrating an advancement in enabling VLMs with active visual perception for video reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes