LG AI IVFeb 3, 2025

Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

Mark Horton, Tergel Molom-Ochir, Peter Liu, Bhavna Gopal, Chiyue Wei, Cong Guo, Brady Taylor, Deliang Fan, Shan X. Wang, Hai Li, Yiran Chen

arXiv:2502.01770v17.12 citationsh-index: 5

Originality Highly original

AI Analysis

This addresses the efficiency problem for deploying long-context Transformers in real-world applications, offering a novel method that is incremental in improving upon prior binarization techniques.

The paper tackles the high computational and memory costs of running pre-trained transformer models with extended context windows by introducing Hamming Attention Distillation (HAD), which binarizes keys and queries to use efficient Hamming distance computations and incorporates attention sparsification, achieving state-of-the-art performance among binarized Transformers with minimal accuracy losses (e.g., 1.78% on GLUE vs. 9.08% prior) and significant hardware efficiency gains (79% area and 87% power reduction).

Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $\mathbf{1.78}\%$ performance losses on GLUE compared to $9.08\%$ in state-of-the-art binarization work, and $\mathbf{2.5}\%$ performance losses on ImageNet compared to $12.14\%$, all while targeting custom hardware with a $\mathbf{79}\%$ area reduction and $\mathbf{87}\%$ power reduction compared to its standard attention counterpart.

View on arXiv PDF

Similar