CVMar 11, 2025

Joint Image-Instance Spatial-Temporal Attention for Few-shot Action Recognition

arXiv:2503.14430v13 citationsh-index: 3Computer Vision and Image Understanding
Originality Incremental advance
AI Analysis

This work addresses the challenge of recognizing actions from limited examples in computer vision, offering a domain-specific improvement for few-shot learning.

The paper tackles the problem of few-shot action recognition by addressing background noise and insufficient focus on action-related instances, proposing a joint image-instance spatial-temporal attention approach that improves recognition accuracy, achieving state-of-the-art results on benchmarks like SSv2-Full and Kinetics with gains of up to 2.3%.

Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial-temporal attention approach (I2ST) for Few-shot Action Recognition. The core concept of I2ST is to perceive the action-related instances and integrate them with image features via spatial-temporal attention. Specifically, I2ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial-temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial-temporal Attention is used to construct the feature dependency between instances and images...

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes