NECLLGNov 6, 2019

Fast Transformer Decoding: One Write-Head is All You Need

arXiv:1911.02150v1788 citations
Originality Incremental advance
AI Analysis

This addresses a bottleneck in real-time applications like machine translation by making Transformer decoding more efficient, though it is an incremental improvement over existing methods.

The paper tackled the slow incremental inference in Transformer decoders due to memory bandwidth costs by proposing multi-query attention, which shares keys and values across heads, resulting in much faster decoding with minor quality degradation.

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding. We verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes