CVFeb 10, 2021

AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition

arXiv:2102.05775v170 citations
Originality Incremental advance
AI Analysis

This addresses the problem of computational inefficiency in video action recognition for researchers and practitioners, offering an incremental improvement by optimizing temporal modeling.

The paper tackles efficient video action recognition by introducing AdaFuse, an adaptive temporal fusion network that dynamically fuses current and past feature maps, achieving about 40% computation savings with comparable accuracy to state-of-the-art methods on datasets like Something V1 & V2, Jester, and Mini-Kinetics.

Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical convolution feature maps is fused with current pruned feature maps with the goal of improving both recognition accuracy and efficiency. In addition, we use a skipping operation to further reduce the computation cost of action recognition. Extensive experiments on Something V1 & V2, Jester and Mini-Kinetics show that our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods. The project page can be found at https://mengyuest.github.io/AdaFuse/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes