Mixture of Attention Schemes (MoAS): Learning to Route Between MHA, GQA, and MQA
This addresses the efficiency bottleneck in large language models for practitioners by offering a dynamic routing approach, though it is incremental as it builds on existing attention schemes.
The paper tackles the trade-off between modeling quality and inference efficiency in Transformer attention mechanisms by proposing Mixture of Attention Schemes (MoAS), which dynamically routes tokens between MHA, GQA, and MQA, achieving a validation loss of 2.3074 on WikiText-2, outperforming static mixtures.
The choice of attention mechanism in Transformer models involves a critical trade-off between modeling quality and inference efficiency. Multi-Head Attention (MHA) offers the best quality but suffers from large Key-Value (KV) cache memory requirements during inference. Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage but often at the cost of model performance. In this work, we propose Mixture of Attention Schemes (MoAS), a novel architecture that dynamically selects the optimal attention scheme (MHA, GQA, or MQA) for each token via a learned router. We demonstrate that dynamic routing performs better than static averaging of schemes and achieves performance competitive with the MHA baseline while offering potential for conditional compute efficiency. Experimental results on WikiText-2 show that dynamic routing (val loss 2.3074) outperforms a static mixture (2.3093), validating the effectiveness of the proposed method. Our code is available at https://github.com/Esmail-ibraheem/Mixture-of-Attention-Schemes-MoAS.