CVFeb 4, 2025

D-Attn: Decomposed Attention for Large Vision-and-Language Models

Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen

arXiv:2502.01906v26.21 citationsh-index: 17Has Code

Originality Highly original

AI Analysis

This addresses efficiency and performance bottlenecks in LVLMs for researchers and practitioners, offering a novel architectural improvement.

The paper tackles the constrained architecture of large vision-and-language models (LVLMs) by proposing Decomposed Attention (D-Attn), which enables flexible visual token processing without affecting textual attention, resulting in significant performance improvements on image benchmarks and a 5x reduction in computational costs.

Large vision-and-language models (LVLMs) have traditionally integrated visual and textual tokens by concatenating them into a single homogeneous input for large language models (LLMs), thereby maximally preserving the pre-trained language capabilities. However, this constrained architecture for visual and textual tokens restricts the design space for processing visual tokens, potentially leading to suboptimal performance and efficiency. In this paper, we propose Decomposed Attention (D-Attn), a more flexible attention architecture for LVLMs, which enables modification of visual token operations without affecting textual-to-textual attention. D-Attn decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions, and the visual and textual output tokens from the decomposed attentions are merged with a carefully derived weighting strategy, namely $α$-weighting. Taking advantage of the flexibility, we are able to introduce two critical improvements in visual token processing while maintaining the capacity of pre-trained LLMs: 1) We rectify the biased positional encoding in textual-to-visual attention to boost visual understanding performance. 2) We diagonalize visual-to-visual attention to reduce computation complexity from $O(|V|^2)$ to $O(|V|)$ for $|V|$ visual tokens without compromising performance. Extensive experiments and analysis validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs (\eg, $5\times$ faster). Code will be available at https://github.com/bytedance/DecomposedAttention.

View on arXiv PDF Code

Similar