CVJul 17, 2020

Region-based Non-local Operation for Video Classification

arXiv:2007.09033v513 citations
AI Analysis

This work addresses video classification challenges by improving dependency modeling, though it is incremental as it builds on existing attention mechanisms.

The paper tackles the difficulty of modeling long-range dependencies in CNNs by proposing region-based non-local operations as a self-attention mechanism, achieving state-of-the-art performance on the Something-Something V1 video classification dataset.

Convolutional Neural Networks (CNNs) model long-range dependencies by deeply stacking convolution operations with small window sizes, which makes the optimizations difficult. This paper presents region-based non-local (RNL) operations as a family of self-attention mechanisms, which can directly capture long-range dependencies without using a deep stack of local operations. Given an intermediate feature map, our method recalibrates the feature at a position by aggregating the information from the neighboring regions of all positions. By combining a channel attention module with the proposed RNL, we design an attention chain, which can be integrated into the off-the-shelf CNNs for end-to-end training. We evaluate our method on two video classification benchmarks. The experimental results of our method outperform other attention mechanisms, and we achieve state-of-the-art performance on the Something-Something V1 dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes