CVJun 28, 2023

SpotEM: Efficient Video Search for Episodic Memory

arXiv:2306.15850v116 citationsh-index: 99
Originality Incremental advance
AI Analysis

This addresses the problem of computational infeasibility for long wearable-camera videos in episodic memory search, though it is incremental as it builds on existing EM methods.

The paper tackles the inefficiency of searching long egocentric videos for episodic memory by proposing SpotEM, which reduces the number of clip features computed to 10%-25% while preserving 84%-97% of the original model's accuracy.

The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., "where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that learns to identify promising video regions to search conditioned on the language query; 2) a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and 3) distillation losses that address the optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10% - 25% of the clip features, we preserve 84% - 97% of the original EM model's accuracy. Project page: https://vision.cs.utexas.edu/projects/spotem

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes