CERN: Confidence-Energy Recurrent Network for Group Activity Recognition
This work addresses the problem of recognizing complex human activities in videos for applications like surveillance or sports analysis, representing an incremental improvement over existing LSTM-based methods.
The paper tackles group activity recognition in videos by introducing a Confidence-Energy Recurrent Network (CERN) that replaces softmax with an energy layer and incorporates p-values for confidence estimation, achieving superior performance on the Collective Activity and Volleyball datasets.
This work is about recognizing human activities occurring in videos at distinct semantic levels, including individual actions, interactions, and group activities. The recognition is realized using a two-level hierarchy of Long Short-Term Memory (LSTM) networks, forming a feed-forward deep architecture, which can be trained end-to-end. In comparison with existing architectures of LSTMs, we make two key contributions giving the name to our approach as Confidence-Energy Recurrent Network -- CERN. First, instead of using the common softmax layer for prediction, we specify a novel energy layer (EL) for estimating the energy of our predictions. Second, rather than finding the common minimum-energy class assignment, which may be numerically unstable under uncertainty, we specify that the EL additionally computes the p-values of the solutions, and in this way estimates the most confident energy minimum. The evaluation on the Collective Activity and Volleyball datasets demonstrates: (i) advantages of our two contributions relative to the common softmax and energy-minimization formulations and (ii) a superior performance relative to the state-of-the-art approaches.