CVAug 12, 2024

OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning

arXiv:2408.06158v118 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

This addresses the limitation of CLIP in capturing spatial-temporal features for video recognition, offering improvements in supervised, few-shot, and zero-shot tasks, though it is incremental as it builds on existing CLIP frameworks.

The paper tackles the problem of adapting CLIP for video recognition by learning spatial-temporal omni-scale features, achieving a top-1 accuracy of 74.30% on HMDB51 in a 16-shot setting and surpassing MotionPrompt with full training data.

Recent Vision-Language Models (VLMs) \textit{e.g.} CLIP have made great progress in video recognition. Despite the improvement brought by the strong visual backbone in extracting spatial features, CLIP still falls short in capturing and integrating spatial-temporal features which is essential for video recognition. In this paper, we propose OmniCLIP, a framework that adapts CLIP for video recognition by focusing on learning comprehensive features encompassing spatial, temporal, and dynamic spatial-temporal scales, which we refer to as omni-scale features. This is achieved through the design of spatial-temporal blocks that include parallel temporal adapters (PTA), enabling efficient temporal modeling. Additionally, we introduce a self-prompt generator (SPG) module to capture dynamic object spatial features. The synergy between PTA and SPG allows OmniCLIP to discern varying spatial information across frames and assess object scales over time. We have conducted extensive experiments in supervised video recognition, few-shot video recognition, and zero-shot recognition tasks. The results demonstrate the effectiveness of our method, especially with OmniCLIP achieving a top-1 accuracy of 74.30\% on HMDB51 in a 16-shot setting, surpassing the recent MotionPrompt approach even with full training data. The code is available at \url{https://github.com/XiaoBuL/OmniCLIP}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes