CVAug 10, 2025

MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

arXiv:2508.07312v11 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for efficient video-text models on mobile devices, representing an incremental improvement by adapting existing techniques to a new domain.

The paper tackles the problem of high latency in video-text models by introducing MobileViCLIP, an efficient model for mobile devices that achieves 55.4x faster inference than InternVideo2-L14 and similar or better zero-shot retrieval performance on benchmarks like MSR-VTT.

Efficient lightweight neural networks are with increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video pre-trained models still focus on the common ViT architecture with high latency, and few works attempt to build efficient architecture on mobile devices. This paper bridges this gap by introducing temporal structural reparameterization into an efficient image-text model and training it on a large-scale high-quality video-text dataset, resulting in an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities, termed as MobileViCLIP. In particular, in terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14. In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9\% better than InternVideo2-S14 on MSR-VTT. The code is available at https://github.com/MCG-NJU/MobileViCLIP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes