CVSDASFeb 10, 2022

OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos

arXiv:2202.04947v34 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of improving action localization in first-person videos for applications like activity recognition, though it is incremental by adding audio to existing visual methods.

The paper tackled the problem of localizing actions in egocentric videos by incorporating audiovisual context, boosting localization performance by +2.23% and +3.35% mAP over visual-only models on two datasets.

Egocentric videos capture sequences of human activities from a first-person perspective and can provide rich multimodal signals. However, most current localization methods use third-person videos and only incorporate visual information. In this work, we take a deep look into the effectiveness of audiovisual context in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL). OWL leverages audiovisual information and context for egocentric temporal action localization (TAL). We validate our approach in two large-scale datasets, EPIC-Kitchens, and HOMAGE. Extensive experiments demonstrate the relevance of the audiovisual temporal context. Namely, we boost the localization performance (mAP) over visual-only models by +2.23% and +3.35% in the above datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes