CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models
This work addresses the memory and latency bottlenecks in vision-language models for practitioners deploying high-resolution VLMs, offering a practical solution that translates theoretical FLOP savings into real wall-clock acceleration.
CIVIC introduces a path-consistent compact visual inference framework that maintains compact sequence representations across all VLM components, achieving genuine hardware efficiency by reducing KV-cache memory to ~1/3 of baseline and lowering end-to-end latency without accuracy loss on multimodal benchmarks.
Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.