CVSep 30, 2022

Alignment-guided Temporal Attention for Video Action Recognition

arXiv:2210.00132v224 citationsh-index: 21
Originality Highly original
AI Analysis

This work addresses a key bottleneck in video learning for computer vision applications, offering a general plug-in module that improves action recognition performance.

The paper tackles the trade-off between efficiency and performance in video action recognition by proposing Alignment-guided Temporal Attention (ATA), which uses patch-level alignments to enhance temporal modeling and achieves state-of-the-art results on multiple benchmarks.

Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more efficient in computation, the latter often obtains better performance. In this paper, we attribute this to a dilemma between the sufficiency and the efficiency of interactions among various positions in different frames. These interactions affect the extraction of task-relevant information shared among frames. To resolve this issue, we prove that frame-by-frame alignments have the potential to increase the mutual information between frame representations, thereby including more task-relevant information to boost effectiveness. Then we propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames. It can act as a general plug-in for image backbones to conduct the action recognition task without any model-specific design. Extensive experiments on multiple benchmarks demonstrate the superiority and generality of our module.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes