CVDec 19, 2025

Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos

arXiv:2512.17229v2h-index: 5
Originality Incremental advance
AI Analysis

This addresses the problem of prohibitive memory consumption and information overload in long video analysis for AI researchers and developers, though it is incremental as it builds on existing MLLM methods.

The paper tackles the challenge of Long Video Question-Answering (LVQA) for Multi-modal Large Language Models (MLLMs) by proposing VideoDetective, an efficient question-aware memory mechanism that recurrently seeks critical clues, enabling processing of 100K tokens (3600 frames) in 2 minutes with 37GB GPU memory.

Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes