CVNov 24, 2025

EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

arXiv:2511.18920v1
Originality Highly original
AI Analysis

This addresses efficiency issues for video understanding tasks, offering a training-free solution that is incremental but impactful for real-world applications.

The paper tackles the high inference cost of video large language models by proposing EventSTU, an event-guided framework that reduces redundant frames and tokens, achieving a 3.01x FLOPs reduction and 3.10x prefilling speedup while improving performance.

Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes