Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers
This work addresses a fundamental architectural inefficiency in large language models, potentially enabling more parameter-efficient designs, though it is incremental as it builds on existing Transformer theory.
The paper tackled the problem of reducing the Query, Key, Value weight triplet in decoder-only Transformers, proving theoretically that Query weights are redundant and validating this with a GPT-3 small model that achieves comparable validation loss while reducing non-embedding parameters by over 8%.
The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.