CVApr 13, 2025

A Survey on Efficient Vision-Language Models

arXiv:2504.09724v327 citationsh-index: 3Has CodeWIREs Data. Mining. Knowl. Discov.
Originality Synthesis-oriented
AI Analysis

It tackles the problem of enabling real-time VLM applications for users on resource-constrained devices, but it is incremental as it surveys existing methods rather than introducing new ones.

This survey addresses the high computational demands of vision-language models (VLMs) by reviewing key techniques for optimizing them on edge and resource-constrained devices, including compact architectures and performance-memory trade-offs, and establishes a GitHub repository to compile resources for fostering further research.

Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes