CVMar 24

ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting

Yeonkyung Lee, Dayun Ju, Youngmin Kim, Seil Kang, Seong Jae Hwang

CMU

arXiv:2603.2318677.21 citationsh-index: 7

AI Analysis

This work addresses a critical efficiency-performance trade-off in video processing for AI applications, offering a lightweight solution to improve temporal reasoning in VideoLLMs, though it is incremental as it builds on existing visual prompting and frame selection methods.

The paper tackles the problem of temporal reasoning degradation in Video Large Language Models when using sparse frame sampling for efficiency, and introduces ViKey, a training-free framework that enhances temporal understanding via visual prompting and keyword-frame mapping, achieving performance comparable to dense-frame baselines with only 20% of frames on some datasets.

Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.

View on arXiv PDF

Similar