CVSep 28, 2019

Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

arXiv:1909.13130v1170 citations
Originality Incremental advance
AI Analysis

This work addresses efficiency in video analysis for action recognition, but it is incremental as it builds on prior decoupling methods.

The paper tackles the high computational cost of 3D CNNs for video action recognition by proposing a grouped spatial-temporal aggregation method that decomposes feature channels into spatial and temporal groups, achieving parameter efficiency and enabling quantitative analysis of feature contributions.

Temporal reasoning is an important aspect of video analysis. 3D CNN shows good performance by exploring spatial-temporal features jointly in an unconstrained way, but it also increases the computational cost a lot. Previous works try to reduce the complexity by decoupling the spatial and temporal filters. In this paper, we propose a novel decomposition method that decomposes the feature channels into spatial and temporal groups in parallel. This decomposition can make two groups focus on static and dynamic cues separately. We call this grouped spatial-temporal aggregation (GST). This decomposition is more parameter-efficient and enables us to quantitatively analyze the contributions of spatial and temporal features in different layers. We verify our model on several action recognition tasks that require temporal reasoning and show its effectiveness.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes