CVApr 10, 2017

ActionVLAD: Learning spatio-temporal aggregation for action classification

arXiv:1704.02895v1468 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of improving video action classification accuracy for computer vision applications, representing an incremental advancement over existing two-stream methods.

The paper tackles action classification in videos by introducing ActionVLAD, a new representation that aggregates local convolutional features across space and time, integrated with two-stream networks. It shows that this approach outperforms the base two-stream architecture by 13% relative and other baselines on benchmarks like HMDB51, UCF101, and Charades.

In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks with learnable spatio-temporal feature aggregation. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as out-performs other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes