CVMar 3, 2024

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

arXiv:2403.01560v224 citationsh-index: 24Has Code
AI Analysis

This addresses a cross-domain generalization challenge for video action recognition, which is incremental as it builds on existing CLIP-based methods.

The paper tackles the problem of CLIP-based video learners struggling to generalize to unseen video domains in open-vocabulary action recognition, establishing a benchmark and proposing a scene-aware alignment method that improves performance, though specific numerical gains are not detailed in the abstract.

Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining), recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to efficient and effective video learners for open-vocabulary action recognition. Inspired by that humans perform actions in diverse environments, our work delves into an intriguing question: Can CLIP-based video learners effectively generalize to video domains they have not encountered during training? To answer this, we establish a CROSS-domain Open-Vocabulary Action recognition benchmark named XOV-Action, and conduct a comprehensive evaluation of five state-of-the-art CLIP-based video learners under various types of domain gaps. The evaluation demonstrates that previous methods exhibit limited action recognition performance in unseen video domains, revealing potential challenges of the cross-domain open-vocabulary action recognition task. In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method. Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains. Extensive experiments demonstrate the effectiveness of our method. The benchmark and code will be available at https://github.com/KunyuLin/XOV-Action/.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes