CVAIMar 19

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

arXiv:2603.1905490.71 citationsh-index: 3
Predicted impact top 15% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the problem of computational constraints in proactive video understanding for applications requiring real-time interaction, though it appears incremental as it builds on existing proactive VideoLLMs.

The paper tackles the efficiency-accuracy dilemma in proactive streaming video understanding by proposing Em-Garde, a framework that decouples semantic understanding from streaming perception, resulting in consistent improvements in proactive response accuracy and efficiency on StreamingBench and OVO-Bench.

Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes