Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning
This work addresses the problem of efficient and generalized action recognition for video analysis, representing an incremental improvement by adapting CLIP with motion-aware prompts.
The paper tackled adapting CLIP for action recognition by explicitly modeling motion cues in videos, resulting in a method that outperforms most state-of-the-art approaches on few-shot and zero-shot training across HMDB-51, UCF-101, and Kinetics-400 datasets, with competitive closed-set performance using minimal parameters and computational cost.
The Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training and has applied to many downstream tasks. We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method. We propose that the key lies in explicitly modeling the motion cues flowing in video frames. To that end, we design a two-stream motion modeling block to capture motion and spatial information at the same time. And then, the obtained motion cues are utilized to drive a dynamic prompts learner to generate motion-aware prompts, which contain much semantic information concerning human actions. In addition, we propose a multimodal communication block to achieve a collaborative learning and further improve the performance. We conduct extensive experiments on HMDB-51, UCF-101, and Kinetics-400 datasets. Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training. We also achieve competitive performance on "closed-set" training with extremely few trainable parameters and additional computational costs.