Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning
This addresses the challenge of recognizing rare fine-grained actions with limited data, which is important for real-world applications like specific action understanding, but it is incremental as it builds on existing few-shot methods.
The paper tackles the problem of few-shot fine-grained action recognition by proposing a bidirectional attention module to capture subtle details and contrastive meta-learning to handle low inter-class variance, achieving state-of-the-art performance on established benchmarks.
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications, whereas the data of rare fine-grained categories is very limited. Therefore, we propose the few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class. Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions: the inability to capture subtle action details and the inadequacy in learning from data with low inter-class variance. To tackle the first issue, a human vision inspired bidirectional attention module (BAM) is proposed. Combining top-down task-driven signals with bottom-up salient stimuli, BAM captures subtle action details by accurately highlighting informative spatio-temporal regions. To address the second issue, we introduce contrastive meta-learning (CML). Compared with the widely adopted ProtoNet-based method, CML generates more discriminative video representations for low inter-class variance data, since it makes full use of potential contrastive pairs in each training episode. Furthermore, to fairly compare different models, we establish specific benchmark protocols on two large-scale fine-grained action recognition datasets. Extensive experiments show that our method consistently achieves state-of-the-art performance across evaluated tasks.