CVJan 8, 2025

Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

arXiv:2501.04322v216 citationsh-index: 32Has CodeAAAI
AI Analysis

This work addresses the problem of efficient deployment of vision language models for edge computing applications, representing an incremental improvement with specific gains.

The paper tackles the challenge of running multimodal vision language models on edge devices by introducing Eve, a framework that balances linguistic and multimodal capabilities, resulting in a 1.8B parameter model that achieves 68.87% accuracy on VLM benchmarks and outperforms larger models in language tasks.

Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary,we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model. Our code is available at https://github.com/rangmiao/Eve.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes