CLMay 2, 2020

Hard-Coded Gaussian Attention for Neural Machine Translation

arXiv:2005.00742v11025 citations
Originality Incremental advance
AI Analysis

This work provides insights into Transformer components for researchers, potentially guiding simpler and more efficient model designs, though it is incremental as it builds on prior questioning of attention mechanisms.

The study investigated the importance of learned attention in Transformers for neural machine translation by replacing self-attention with fixed Gaussian distributions, finding minimal BLEU score impact, but hard-coding cross attention significantly reduced BLEU, which could be partially recovered with a single learned head.

Recent work has questioned the importance of the Transformer's multi-headed attention for achieving high translation quality. We push further in this direction by developing a "hard-coded" attention variant without any learned parameters. Surprisingly, replacing all learned self-attention heads in the encoder and decoder with fixed, input-agnostic Gaussian distributions minimally impacts BLEU scores across four different language pairs. However, additionally hard-coding cross attention (which connects the decoder to the encoder) significantly lowers BLEU, suggesting that it is more important than self-attention. Much of this BLEU drop can be recovered by adding just a single learned cross attention head to an otherwise hard-coded Transformer. Taken as a whole, our results offer insight into which components of the Transformer are actually important, which we hope will guide future work into the development of simpler and more efficient attention-based models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes