Learn to cycle: Time-consistent feature discovery for action recognition
This work addresses the problem of focusing on short-term discriminative motions in video action recognition for computer vision applications, representing an incremental improvement over existing methods.
The paper tackles the challenge of generalizing over temporal variations in video action recognition by introducing Squeeze and Recursion Temporal Gates (SRTG), which discovers relevant spatio-temporal features with flexibility, resulting in consistent improvements and performance on par with or better than state-of-the-art models on datasets like Kinetics-700, HACS, Moments in Time, UCF-101, and HMDB-51.
Generalizing over temporal variations is a prerequisite for effective action recognition in videos. Despite significant advances in deep neural networks, it remains a challenge to focus on short-term discriminative motions in relation to the overall performance of an action. We address this challenge by allowing some flexibility in discovering relevant spatio-temporal features. We introduce Squeeze and Recursion Temporal Gates (SRTG), an approach that favors inputs with similar activations with potential temporal variations. We implement this idea with a novel CNN block that uses an LSTM to encapsulate feature dynamics, in conjunction with a temporal gate that is responsible for evaluating the consistency of the discovered dynamics and the modeled features. We show consistent improvement when using SRTG blocks, with only a minimal increase in the number of GFLOPs. On Kinetics-700, we perform on par with current state-of-the-art models, and outperform these on HACS, Moments in Time, UCF-101 and HMDB-51.