Region-based Non-local Operation for Video Classification
This work addresses video classification challenges by improving dependency modeling, though it is incremental as it builds on existing attention mechanisms.
The paper tackles the difficulty of modeling long-range dependencies in CNNs by proposing region-based non-local operations as a self-attention mechanism, achieving state-of-the-art performance on the Something-Something V1 video classification dataset.
Convolutional Neural Networks (CNNs) model long-range dependencies by deeply stacking convolution operations with small window sizes, which makes the optimizations difficult. This paper presents region-based non-local (RNL) operations as a family of self-attention mechanisms, which can directly capture long-range dependencies without using a deep stack of local operations. Given an intermediate feature map, our method recalibrates the feature at a position by aggregating the information from the neighboring regions of all positions. By combining a channel attention module with the proposed RNL, we design an attention chain, which can be integrated into the off-the-shelf CNNs for end-to-end training. We evaluate our method on two video classification benchmarks. The experimental results of our method outperform other attention mechanisms, and we achieve state-of-the-art performance on the Something-Something V1 dataset.