CVAIFeb 6, 2024

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

arXiv:2402.03766v1179 citationsh-index: 29Has Code
Originality Incremental advance
AI Analysis

This provides more efficient VLMs for mobile applications, but it is incremental as it builds upon prior work like MobileVLM.

The paper tackles improving vision language models (VLMs) for mobile devices by introducing MobileVLM V2, which achieves better or on-par performance with larger models, such as a 1.7B model matching 3B-scale VLMs and a 3B model outperforming 7B+ scale ones.

We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes