CVSDASNov 1, 2021

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

arXiv:2111.01024v157 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses action recognition in first-person videos, which is incremental as it builds on existing multimodal methods by adding temporal and language context.

The paper tackles the problem of recognizing actions in egocentric videos by leveraging temporal context, achieving state-of-the-art performance on EPIC-KITCHENS and EGTEA datasets.

In egocentric videos, actions occur in quick succession. We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes