CVMay 21, 2025

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

arXiv:2505.15269v134 citationsh-index: 8
Originality Highly original
AI Analysis

This addresses the need for real-time video interaction in applications like autonomous driving and robotics, representing a novel method for a known bottleneck rather than an incremental improvement.

The paper tackles the problem of inefficient memory usage and slow response speeds in online video understanding by proposing LiveVLM, a training-free framework that enables processing 44x more frames and achieves up to 5x speedup compared to state-of-the-art methods while maintaining performance.

Recent developments in Video Large Language Models (Video LLMs) have enabled models to process long video sequences and demonstrate remarkable performance. Nonetheless, studies predominantly focus on offline video question answering, neglecting memory usage and response speed that are essential in various real-world applications, such as Deepseek services, autonomous driving, and robotics. To mitigate these challenges, we propose $\textbf{LiveVLM}$, a training-free framework specifically designed for streaming, online video understanding and real-time interaction. Unlike existing works that process videos only after one question is posed, LiveVLM constructs an innovative streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs, ensuring prompt responses to user queries. For continuous video streams, LiveVLM generates and compresses video key-value tensors (video KVs) to reserve visual information while improving memory efficiency. Furthermore, when a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information, while minimizing interference from redundant context. Extensive experiments demonstrate that LiveVLM enables the foundation LLaVA-OneVision model to process 44$\times$ number of frames on the same device, and achieves up to 5$\times$ speedup in response speed compared with SoTA online methods at an input of 256 frames, while maintaining the same or better model performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes