CVAIMar 17

Empirical Recipes for Efficient and Compact Vision-Language Models

arXiv:2603.1698754.0h-index: 14
AI Analysis

This work addresses the need for faster and more efficient VLMs in resource-constrained settings, offering practical solutions that are broadly applicable across architectures and frameworks.

The paper tackled the problem of inefficient inference in compact vision-language models (VLMs) by identifying bottlenecks and developing optimization recipes, resulting in latency reductions of 53% on InternVL3-2B and 93% on SmolVLM-256M while preserving accuracy.

Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes