CV AIMar 19

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng, Xin Ding, Yifan Yang, Shiqi Jiang, Hao Wu, Qianxi Zhang, Weijun Wang, Ting Cao, Yunxin Liu

arXiv:2603.1905490.71 citationsh-index: 3

Predicted impact top 15% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the problem of computational constraints in proactive video understanding for applications requiring real-time interaction, though it appears incremental as it builds on existing proactive VideoLLMs.

The paper tackles the efficiency-accuracy dilemma in proactive streaming video understanding by proposing Em-Garde, a framework that decouples semantic understanding from streaming perception, resulting in consistent improvements in proactive response accuracy and efficiency on StreamingBench and OVO-Bench.

Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

View on arXiv PDF

Similar