CVSep 21, 2025

MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors

arXiv:2509.17084v2h-index: 3Has Code2025 18th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)
Originality Incremental advance
AI Analysis

This work provides a highly efficient baseline for video understanding, bridging large static models with low-cost motion cues, but it is incremental as it combines existing components.

The paper tackled efficient video action recognition by fusing a frozen CLIP image encoder with motion vectors, achieving 89.2% Top-1 accuracy on UCF101, outperforming zero-shot and MV-only baselines.

Video action recognition is a fundamental task in computer vision, but state-of-the-art models are often computationally expensive and rely on extensive video pre-training. In parallel, large-scale vision-language models like Contrastive Language-Image Pre-training (CLIP) offer powerful zero-shot capabilities on static images, while motion vectors (MV) provide highly efficient temporal information directly from compressed video streams. To synergize the strengths of these paradigms, we propose MoCLIP-Lite, a simple yet powerful two-stream late fusion framework for efficient video recognition. Our approach combines features from a frozen CLIP image encoder with features from a lightweight, supervised network trained on raw MV. During fusion, both backbones are frozen, and only a tiny Multi-Layer Perceptron (MLP) head is trained, ensuring extreme efficiency. Through comprehensive experiments on the UCF101 dataset, our method achieves a remarkable 89.2% Top-1 accuracy, significantly outperforming strong zero-shot (65.0%) and MV-only (66.5%) baselines. Our work provides a new, highly efficient baseline for video understanding that effectively bridges the gap between large static models and dynamic, low-cost motion cues. Our code and models are available at https://github.com/microa/MoCLIP-Lite.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes