CVLGASOct 15, 2019

Seeing and Hearing Egocentric Actions: How Much Can We Learn?

arXiv:1910.06693v120 citations
Originality Incremental advance
AI Analysis

This addresses action recognition for egocentric video analysis, but it is incremental as it builds on existing multimodal methods.

The paper tackles egocentric action recognition in a kitchen environment by integrating visual and audio modalities, achieving a 5.18% improvement over state-of-the-art on verb classification.

Our interaction with the world is an inherently multimodal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial, and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a 5.18% improvement over the state of the art on verb classification.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes