LGJun 3, 2025

QKV Projections Require a Fraction of Their Memory

arXiv:2506.02939v21 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses memory efficiency for large language model training, though it is incremental as it builds on existing attention mechanisms.

The paper tackles the memory consumption of QKV projections in attention layers by proposing Point-Approximate Matrix Multiplication (PAMM), which reduces memory usage by up to 512x while maintaining or improving perplexity.

The Multi-Head Attention mechanism is central to LLM operation, and multiple works target its compute and memory efficiency during training. While most works focus on approximating the scaled dot product, the memory consumption of the linear projections that compute the $Q$, $K$, and $V$ tensors from the input $x$ is often overlooked. To address this, we propose Point-Approximate Matrix Multiplication (PAMM), a novel tensor compression technique that reduces memory consumption of the $Q,K,V$ projections in attention layers by a factor of up to $\times 512$, effectively erasing their memory footprint, while achieving similar or better final perplexity. PAMM is fully composable with efficient attention techniques such as FlashAttention, making it a practical and complementary method for memory-efficient LLM training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes