Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers
This addresses the efficiency problem for deploying long-context Transformers in real-world applications, offering a novel method that is incremental in improving upon prior binarization techniques.
The paper tackles the high computational and memory costs of running pre-trained transformer models with extended context windows by introducing Hamming Attention Distillation (HAD), which binarizes keys and queries to use efficient Hamming distance computations and incorporates attention sparsification, achieving state-of-the-art performance among binarized Transformers with minimal accuracy losses (e.g., 1.78% on GLUE vs. 9.08% prior) and significant hardware efficiency gains (79% area and 87% power reduction).
Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $\mathbf{1.78}\%$ performance losses on GLUE compared to $9.08\%$ in state-of-the-art binarization work, and $\mathbf{2.5}\%$ performance losses on ImageNet compared to $12.14\%$, all while targeting custom hardware with a $\mathbf{79}\%$ area reduction and $\mathbf{87}\%$ power reduction compared to its standard attention counterpart.