CV SD ASFeb 10, 2022

OWL (Observe, Watch, Listen): Audiovisual Temporal Context for Localizing Actions in Egocentric Videos

Merey Ramazanova, Victor Escorcia, Fabian Caba Heilbron, Chen Zhao, Bernard Ghanem

arXiv:2202.04947v32.64 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of improving action localization in first-person videos for applications like activity recognition, though it is incremental by adding audio to existing visual methods.

The paper tackled the problem of localizing actions in egocentric videos by incorporating audiovisual context, boosting localization performance by +2.23% and +3.35% mAP over visual-only models on two datasets.

Egocentric videos capture sequences of human activities from a first-person perspective and can provide rich multimodal signals. However, most current localization methods use third-person videos and only incorporate visual information. In this work, we take a deep look into the effectiveness of audiovisual context in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL). OWL leverages audiovisual information and context for egocentric temporal action localization (TAL). We validate our approach in two large-scale datasets, EPIC-Kitchens, and HOMAGE. Extensive experiments demonstrate the relevance of the audiovisual temporal context. Namely, we boost the localization performance (mAP) over visual-only models by +2.23% and +3.35% in the above datasets.

View on arXiv PDF

Similar