CVMar 12

WAT: Online Video Understanding Needs Watching Before Thinking

arXiv:2603.1341290.2h-index: 6Has Code
Predicted impact top 15% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the problem of real-time video understanding for applications requiring streaming analysis, though it is incremental as it builds on existing Video LLM efforts.

The paper tackles the challenge of enabling Multimodal Large Language Models to perform online video reasoning under memory constraints by proposing WAT, a two-stage framework that separates watching and thinking stages with a hierarchical memory system. It achieves state-of-the-art results, including 77.7% accuracy on StreamingBench and 55.2% on OVO-Bench, while operating at real-time frame rates.

Multimodal Large Language Models (MLLMs) have shown strong capabilities in image understanding, motivating recent efforts to extend them to video reasoning. However, existing Video LLMs struggle in online streaming scenarios, where long temporal context must be preserved under strict memory constraints. We propose WAT (Watching Before Thinking), a two-stage framework for online video reasoning. WAT separates processing into a query-independent watching stage and a query-triggered thinking stage. The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) that buffers recent frames and a fixed-capacity Long-Term Memory (LTM) that maintains a diverse summary of historical content using a redundancy-aware eviction policy. In the thinking stage, a context-aware retrieval mechanism combines the query with the current STM context to retrieve relevant historical frames from the LTM for cross-temporal reasoning. To support training for online video tasks, we introduce WAT-85K, a dataset containing streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting. Experiments show that WAT achieves state-of-the-art performance on online video benchmarks, including 77.7% accuracy on StreamingBench and 55.2% on OVO-Bench, outperforming existing open-source online Video LLMs while operating at real-time frame rates.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes