CVLGApr 1, 2025

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

arXiv:2504.00557v111 citationsh-index: 5
Originality Highly original
AI Analysis

This work addresses a compute bottleneck in cross-attention-based LVLMs, offering a practical efficiency improvement for deployment.

The paper tackles the high inference cost from extensive image features in cross-attention-based vision-language models by exploiting sparse cross-attention maps to prune redundant visual features, reducing KV cache demands by 50% without extra training while maintaining benchmark performance.

Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes