LGApr 18, 2024

KV-weights are all you need for skipless transformers

arXiv:2404.12362v22 citationsh-index: 2Has Code
Originality Synthesis-oriented
AI Analysis

This work provides a practical optimization for popular LLMs using MQA and GQA, such as Llama 2 and Mistral, by reducing compute and memory complexity, though it is incremental as it extends an existing method.

The paper addresses the limitation of a previous skipless transformer design that only works with multi-head attention (MHA) by proposing mathematically equivalent versions for multi-query attention (MQA) and grouped-query attention (GQA), enabling a 15% weight reduction in models like Mistral-7B.

He and Hofmann (arXiv:2311.01906) detailed a skipless transformer without the V and P (post-attention projection) linear layers, which reduces the total number of weights. However, this scheme is only applicable to MHA (multi-head attention), but not for MQA (multi-query attention) and GQA (grouped-query attention). The latter schemes are used by many popular LLMs such as Llama 2, Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes mathematically equivalent versions that are suitable for MQA and GQA. For example, removing Q and P from a skipless version of Mistral-7B would remove 15% of its weights (and thus reduce its compute and memory complexity). Watch our explainer video https://youtu.be/Tx_lMpphd2g and see https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes