CVMar 1

Event-Anchored Frame Selection for Effective Long-Video Understanding

arXiv:2603.00983v11 citationsh-index: 3
Originality Highly original
AI Analysis

It addresses frame redundancy and context limitations in long-video analysis for users of large vision-language models, offering a plug-and-play solution with strong gains.

The paper tackles the problem of efficient frame selection for long-video understanding by introducing Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline that improves accuracy by 4.7% to 8.8% on benchmarks when integrated into large vision-language models.

Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual diversity. As a training-free, plug-and-play module, EFS can be seamlessly integrated into off-the-shelf LVLMs, yielding substantial gains on challenging video understanding benchmarks. Specifically, when applied to LLaVA-Video-7B, EFS improves accuracy by 4.7%, 4.9%, and 8.8% on VideoMME, LongVideoBench, and MLVU, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes