CVMMMar 12, 2025

Generative Frame Sampler for Long Video Understanding

arXiv:2503.09146v231 citationsh-index: 13Has CodeACL
Originality Incremental advance
AI Analysis

This addresses the computational burden in long video understanding for AI systems, offering an incremental improvement through a novel sampling method.

The paper tackles the challenge of efficiently understanding long-form videos with thousands of frames by introducing Generative Frame Sampler (GenS), a plug-and-play module that boosts VideoLLM performance, achieving state-of-the-art results such as LLaVA-Video-72B reaching 66.8 (+4.3) on LongVideoBench and 77.0 (+2.7) on MLVU.

Despite recent advances in Video Large Language Models (VideoLLMs), effectively understanding long-form videos remains a significant challenge. Perceiving lengthy videos containing thousands of frames poses substantial computational burden. To mitigate this issue, this paper introduces Generative Frame Sampler (GenS), a plug-and-play module integrated with VideoLLMs to facilitate efficient lengthy video perception. Built upon a lightweight VideoLLM, GenS leverages its inherent vision-language capabilities to identify question-relevant frames. To facilitate effective retrieval, we construct GenS-Video-150K, a large-scale video instruction dataset with dense frame relevance annotations. Extensive experiments demonstrate that GenS consistently boosts the performance of various VideoLLMs, including open-source models (Qwen2-VL-7B, Aria-25B, VILA-40B, LLaVA-Video-7B/72B) and proprietary assistants (GPT-4o, Gemini). When equipped with GenS, open-source VideoLLMs achieve impressive state-of-the-art results on long-form video benchmarks: LLaVA-Video-72B reaches 66.8 (+4.3) on LongVideoBench and 77.0 (+2.7) on MLVU, while Aria obtains 39.2 on HourVideo surpassing the Gemini-1.5-pro by 1.9 points. We will release all datasets and models at https://generative-sampler.github.io.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes