LGAIOct 27, 2025

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers

arXiv:2510.23912v22 citations
Originality Incremental advance
AI Analysis

This work addresses a fundamental architectural inefficiency in large language models, potentially enabling more parameter-efficient designs, though it is incremental as it builds on existing Transformer theory.

The paper tackled the problem of reducing the Query, Key, Value weight triplet in decoder-only Transformers, proving theoretically that Query weights are redundant and validating this with a GPT-3 small model that achieves comparable validation loss while reducing non-embedding parameters by over 8%.

The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes