CVMar 1

Event-Anchored Frame Selection for Effective Long-Video Understanding

Wang Chen, Yongdong Luo, Yuhui Zeng, Luojun Lin, Tianyu Xie, Fei Chao, Rongrong Ji, Xiawu Zheng

arXiv:2603.00983v12.81 citationsh-index: 16

Originality Highly original

AI Analysis

It addresses frame redundancy and context limitations in long-video analysis for users of large vision-language models, offering a plug-and-play solution with strong gains.

The paper tackles the problem of efficient frame selection for long-video understanding by introducing Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline that improves accuracy by 4.7% to 8.8% on benchmarks when integrated into large vision-language models.

Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.

View on arXiv PDF

Similar