CVApr 13, 2025

A Survey on Efficient Vision-Language Models

Gaurav Shinde, Anuradha Ravi, Emon Dey, Shadman Sakib, Milind Rampure, Nirmalya Roy

arXiv:2504.09724v321.127 citationsh-index: 28Has CodeWIREs Data. Mining. Knowl. Discov.

Originality Synthesis-oriented

AI Analysis

It tackles the problem of enabling real-time VLM applications for users on resource-constrained devices, but it is incremental as it surveys existing methods rather than introducing new ones.

This survey addresses the high computational demands of vision-language models (VLMs) by reviewing key techniques for optimizing them on edge and resource-constrained devices, including compact architectures and performance-memory trade-offs, and establishes a GitHub repository to compile resources for fostering further research.

Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at https://github.com/MPSCUMBC/Efficient-Vision-Language-Models-A-Survey to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.

View on arXiv PDF Code

Similar