CV AI LGSep 30, 2021

Unsupervised Few-Shot Action Recognition via Action-Appearance Aligned Meta-Adaptation

Jay Patravali, Gaurav Mittal, Ye Yu, Fuxin Li, Mei Chen

arXiv:2109.15317v29.424 citations

Originality Incremental advance

AI Analysis

This addresses the problem of reducing annotation costs for video action recognition in computer vision, though it is incremental as it builds on existing meta-learning and unsupervised techniques.

The paper tackles unsupervised few-shot action recognition by introducing MetaUVFS, a method that uses over 550K unlabeled videos and a novel alignment module to train without base-class labels, achieving competitive or superior performance compared to supervised state-of-the-art methods on benchmarks like HMDB51, UCF101, and Kinetics100.

We present MetaUVFS as the first Unsupervised Meta-learning algorithm for Video Few-Shot action recognition. MetaUVFS leverages over 550K unlabeled videos to train a two-stream 2D and 3D CNN architecture via contrastive learning to capture the appearance-specific spatial and action-specific spatio-temporal video features respectively. MetaUVFS comprises a novel Action-Appearance Aligned Meta-adaptation (A3M) module that learns to focus on the action-oriented video features in relation to the appearance features via explicit few-shot episodic meta-learning over unsupervised hard-mined episodes. Our action-appearance alignment and explicit few-shot learner conditions the unsupervised training to mimic the downstream few-shot task, enabling MetaUVFS to significantly outperform all unsupervised methods on few-shot benchmarks. Moreover, unlike previous few-shot action recognition methods that are supervised, MetaUVFS needs neither base-class labels nor a supervised pretrained backbone. Thus, we need to train MetaUVFS just once to perform competitively or sometimes even outperform state-of-the-art supervised methods on popular HMDB51, UCF101, and Kinetics100 few-shot datasets.

View on arXiv PDF

Similar