CVAIDec 9, 2021

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

arXiv:2112.04674v327 citationsHas Code
Originality Highly original
AI Analysis

This addresses efficiency issues in video recognition for researchers and practitioners, offering a novel method that balances local and global dependencies with reduced computational overhead.

The paper tackles the high computational cost of transformers in video recognition by proposing DualFormer, a local-global stratified transformer that reduces FLOPs by at least 3.2x while achieving 82.9%/85.2% top-1 accuracy on Kinetics-400/600 benchmarks.

While transformers have shown great potential on video recognition with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by the self-attention to the huge number of 3D tokens. In this paper, we present a new transformer architecture termed DualFormer, which can efficiently perform space-time attention for video recognition. Concretely, DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local interactions among nearby 3D tokens, and then to capture coarse-grained global dependencies between the query token and global pyramid contexts. Different from existing methods that apply space-time factorization or restrict attention computations within local windows for improving efficiency, our local-global stratification strategy can well capture both short- and long-range spatiotemporal dependencies, and meanwhile greatly reduces the number of keys and values in attention computation to boost efficiency. Experimental results verify the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer achieves 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with ~1000G inference FLOPs which is at least 3.2x fewer than existing methods with similar performance. We have released the source code at https://github.com/sail-sg/dualformer.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes