CVAug 9, 2023

Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning

arXiv:2308.04828v133 citationsh-index: 52
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient and generalized action recognition for video analysis, representing an incremental improvement by adapting CLIP with motion-aware prompts.

The paper tackled adapting CLIP for action recognition by explicitly modeling motion cues in videos, resulting in a method that outperforms most state-of-the-art approaches on few-shot and zero-shot training across HMDB-51, UCF-101, and Kinetics-400 datasets, with competitive closed-set performance using minimal parameters and computational cost.

The Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training and has applied to many downstream tasks. We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method. We propose that the key lies in explicitly modeling the motion cues flowing in video frames. To that end, we design a two-stream motion modeling block to capture motion and spatial information at the same time. And then, the obtained motion cues are utilized to drive a dynamic prompts learner to generate motion-aware prompts, which contain much semantic information concerning human actions. In addition, we propose a multimodal communication block to achieve a collaborative learning and further improve the performance. We conduct extensive experiments on HMDB-51, UCF-101, and Kinetics-400 datasets. Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training. We also achieve competitive performance on "closed-set" training with extremely few trainable parameters and additional computational costs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes