CV LG ASOct 15, 2019

Seeing and Hearing Egocentric Actions: How Much Can We Learn?

Alejandro Cartas, Jordi Luque, Petia Radeva, Carlos Segura, Mariella Dimiccoli

arXiv:1910.06693v19.420 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses action recognition for egocentric video analysis, but it is incremental as it builds on existing multimodal methods.

The paper tackles egocentric action recognition in a kitchen environment by integrating visual and audio modalities, achieving a 5.18% improvement over state-of-the-art on verb classification.

Our interaction with the world is an inherently multimodal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial, and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a 5.18% improvement over the state of the art on verb classification.

View on arXiv PDF Code

Similar