CVDec 4, 2024

Streaming Detection of Queried Event Start

SalesforceStanford
arXiv:2412.03567v12 citationsh-index: 64NIPS
Originality Synthesis-oriented
AI Analysis

This addresses the need for quick reaction to user-defined events in robotics and autonomous systems, though it is incremental as it builds on existing vision-language models with adapters.

The paper tackles the problem of detecting the start of complex events in real-time from natural language queries in egocentric videos, introducing a new benchmark and achieving high accuracy with low latency through adapter-based baselines.

Robotics, autonomous driving, augmented reality, and many embodied computer vision applications must quickly react to user-defined events unfolding in real time. We address this setting by proposing a novel task for multimodal video understanding-Streaming Detection of Queried Event Start (SDQES). The goal of SDQES is to identify the beginning of a complex event as described by a natural language query, with high accuracy and low latency. We introduce a new benchmark based on the Ego4D dataset, as well as new task-specific metrics to study streaming multimodal detection of diverse events in an egocentric video setting. Inspired by parameter-efficient fine-tuning methods in NLP and for video tasks, we propose adapter-based baselines that enable image-to-video transfer learning, allowing for efficient online video modeling. We evaluate three vision-language backbones and three adapter architectures on both short-clip and untrimmed video settings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes