AIMay 27

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

arXiv:2605.2811557.1
AI Analysis

This work addresses the memory and latency bottlenecks in vision-language models for practitioners deploying high-resolution VLMs, offering a practical solution that translates theoretical FLOP savings into real wall-clock acceleration.

CIVIC introduces a path-consistent compact visual inference framework that maintains compact sequence representations across all VLM components, achieving genuine hardware efficiency by reducing KV-cache memory to ~1/3 of baseline and lowering end-to-end latency without accuracy loss on multimodal benchmarks.

Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes