MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for Few-shot Video Classification
This work addresses the problem of video classification with limited labeled data for researchers and practitioners in computer vision, representing an incremental improvement over existing methods.
The paper tackles few-shot video classification by proposing MASTAF, a model-agnostic network that uses attention mechanisms to enhance spatio-temporal representations, achieving state-of-the-art performance with up to 91.6%, 69.5%, and 60.7% accuracy on three benchmarks for five-way one-shot tasks.
We propose MASTAF, a Model-Agnostic Spatio-Temporal Attention Fusion network for few-shot video classification. MASTAF takes input from a general video spatial and temporal representation,e.g., using 2D CNN, 3D CNN, and Video Transformer. Then, to make the most of such representations, we use self- and cross-attention models to highlight the critical spatio-temporal region to increase the inter-class variations and decrease the intra-class variations. Last, MASTAF applies a lightweight fusion network and a nearest neighbor classifier to classify each query video. We demonstrate that MASTAF improves the state-of-the-art performance on three few-shot video classification benchmarks(UCF101, HMDB51, and Something-Something-V2), e.g., by up to 91.6%, 69.5%, and 60.7% for five-way one-shot video classification, respectively.