CVAILGSep 2, 2021

SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric Videos

arXiv:2109.00829v129 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of anticipating human actions in egocentric videos, which is crucial for applications like assistive robotics, but it is incremental as it builds upon the existing RULSTM architecture.

The paper tackles action anticipation in egocentric videos by proposing a novel attention-based technique that processes slow and fast features from RGB, optical flow, and object modalities, building upon the RULSTM architecture. It demonstrates systematic improvements in Top-5 accuracy on EpicKitchens-55 and EGTEA Gaze+ datasets at various anticipation times.

Action anticipation in egocentric videos is a difficult task due to the inherently multi-modal nature of human actions. Additionally, some actions happen faster or slower than others depending on the actor or surrounding context which could vary each time and lead to different predictions. Based on this idea, we build upon RULSTM architecture, which is specifically designed for anticipating human actions, and propose a novel attention-based technique to evaluate, simultaneously, slow and fast features extracted from three different modalities, namely RGB, optical flow, and extracted objects. Two branches process information at different time scales, i.e., frame-rates, and several fusion schemes are considered to improve prediction accuracy. We perform extensive experiments on EpicKitchens-55 and EGTEA Gaze+ datasets, and demonstrate that our technique systematically improves the results of RULSTM architecture for Top-5 accuracy metric at different anticipation times.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes