LGJun 21, 2024

Optimised Grouped-Query Attention Mechanism for Transformers

Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao

arXiv:2406.14963v114.213 citations

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in transformer efficiency for large language models, offering an incremental improvement over existing GQA methods.

The paper tackles the trade-off between model performance and hardware efficiency in grouped-query attention (GQA) for transformers by proposing AsymGQA, an activation-informed asymmetric grouping method, which improves accuracy by 7.5% on MMLU for LLaMA-2-7B compared to standard GQA.

Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.

View on arXiv PDF

Similar