Directional Temporal Modeling for Action Recognition
This addresses the need for better temporal modeling in action recognition, offering a light-weight, attachable solution for existing networks.
The paper tackled the problem of clip-level ordered temporal information in action recognition by introducing a channel independent directional convolution (CIDC) operation, which improved state-of-the-art techniques on four popular datasets.
Many current activity recognition models use 3D convolutional neural networks (e.g. I3D, I3D-NL) to generate local spatial-temporal features. However, such features do not encode clip-level ordered temporal information. In this paper, we introduce a channel independent directional convolution (CIDC) operation, which learns to model the temporal evolution among local features. By applying multiple CIDC units we construct a light-weight network that models the clip-level temporal evolution across multiple spatial scales. Our CIDC network can be attached to any activity recognition backbone network. We evaluate our method on four popular activity recognition datasets and consistently improve upon state-of-the-art techniques. We further visualize the activation map of our CIDC network and show that it is able to focus on more meaningful, action related parts of the frame.