CVCLMMDec 12, 2025

HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning

arXiv:2512.11534v11 citationsh-index: 5
Originality Highly original
AI Analysis

This work addresses the problem of computational inefficiency in video reasoning for AI systems, representing a novel method rather than an incremental improvement.

The paper tackles the problem of inefficient key frame selection in video understanding by proposing an end-to-end trainable framework that optimizes frame selection as a holistic set, addressing issues like temporal clustering and visual redundancy. The method achieves significant performance improvements across multiple benchmarks including Video-MME, LongVideoBench, MLVU, and NExT-QA.

Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes