CVMar 27, 2025

Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

arXiv:2503.21782v16 citationsh-index: 14Has Code
Originality Incremental advance
AI Analysis

This work addresses inefficiencies in video understanding models for practical applications, representing an incremental improvement in efficiency and performance.

The paper tackled the problem of high computational requirements and slow inference in video understanding models by proposing Mobile-VideoGPT, an efficient multimodal framework with fewer than a billion parameters, which achieved up to 46 tokens per second and outperformed existing 0.5B-parameter models by 6 points on average with 40% fewer parameters and over 2x higher throughput.

Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes