CVFeb 4, 2025

D-Attn: Decomposed Attention for Large Vision-and-Language Models

arXiv:2502.01906v21 citationsh-index: 17Has Code
Originality Highly original
AI Analysis

This addresses efficiency and performance bottlenecks in LVLMs for researchers and practitioners, offering a novel architectural improvement.

The paper tackles the constrained architecture of large vision-and-language models (LVLMs) by proposing Decomposed Attention (D-Attn), which enables flexible visual token processing without affecting textual attention, resulting in significant performance improvements on image benchmarks and a 5x reduction in computational costs.

Large vision-and-language models (LVLMs) have traditionally integrated visual and textual tokens by concatenating them into a single homogeneous input for large language models (LLMs), thereby maximally preserving the pre-trained language capabilities. However, this constrained architecture for visual and textual tokens restricts the design space for processing visual tokens, potentially leading to suboptimal performance and efficiency. In this paper, we propose Decomposed Attention (D-Attn), a more flexible attention architecture for LVLMs, which enables modification of visual token operations without affecting textual-to-textual attention. D-Attn decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions, and the visual and textual output tokens from the decomposed attentions are merged with a carefully derived weighting strategy, namely $α$-weighting. Taking advantage of the flexibility, we are able to introduce two critical improvements in visual token processing while maintaining the capacity of pre-trained LLMs: 1) We rectify the biased positional encoding in textual-to-visual attention to boost visual understanding performance. 2) We diagonalize visual-to-visual attention to reduce computation complexity from $O(|V|^2)$ to $O(|V|)$ for $|V|$ visual tokens without compromising performance. Extensive experiments and analysis validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs (\eg, $5\times$ faster). Code will be available at https://github.com/bytedance/DecomposedAttention.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes