CVApr 18

Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition

arXiv:2604.1706220.4h-index: 3
AI Analysis

For video action recognition researchers, it provides a novel method to improve zero-shot generalization by explicitly separating motion and static cues and leveraging negative prompts.

The paper addresses zero-shot action recognition by enhancing CLIP with disentangled motion and static features and using negative prompts to model non-class semantics, achieving state-of-the-art results on multiple benchmarks.

Zero-shot action recognition is challenging due to the semantic gap between seen and unseen classes. We present a novel framework that enhances CLIP with disentangled embeddings and semantic-guided interaction. A Motion Separation Module (MSM) separates motion-sensitive and global-static features, while a Motion Aggregation Block (MAB) employs gated cross-attention to refine motion representation without re-coupling redundant information. To facilitate generalization to unseen categories, we enforce semantic alignment between video features and textual representations by aligning projected embeddings with positive textual prompts, while leveraging negative prompts to explicitly model "non-class" semantics. Experiments on standard benchmarks demonstrate that our method consistently outperforms prior CLIP-based approaches, achieving robust zero-shot action recognition across both coarse and fine-grained datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes