CVNov 25, 2024

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

arXiv:2411.16934v27 citationsh-index: 38
Originality Incremental advance
AI Analysis

This addresses the limitation of offline episodic memory systems for power-constrained wearable devices, though it is incremental as it builds on existing object discovery and tracking methods.

The paper tackles the problem of enabling wearable cameras to retrieve object localizations from video streams in real-time, introducing the OVQ2D task and ESOM framework, which achieves only ~4% success but shows potential with improvements in object tracking and discovery.

Episodic memory retrieval enables wearable cameras to recall objects or events previously observed in video. However, existing formulations assume an "offline" setting with full video access at query time, limiting their applicability in real-world scenarios with power and storage-constrained wearable devices. Towards more application-ready episodic memory systems, we introduce Online Visual Query 2D (OVQ2D), a task where models process video streams online, observing each frame only once, and retrieve object localizations using a compact memory instead of full video history. We address OVQ2D with ESOM (Egocentric Streaming Object Memory), a novel framework integrating an object discovery module, an object tracking module, and a memory module that find, track, and store spatio-temporal object information for efficient querying. Experiments on Ego4D demonstrate ESOM's superiority over other online approaches, though OVQ2D remains challenging, with top performance at only ~4% success. ESOM's accuracy increases markedly with perfect object tracking (31.91%), discovery (40.55%), or both (81.92%), underscoring the need of applied research on these components.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes