CLAIMar 13, 2021

Approximating How Single Head Attention Learns

arXiv:2103.07601v336 citations
Originality Incremental advance
AI Analysis

This work provides insights into attention mechanisms in neural networks, which is incremental for understanding model interpretability and training dynamics in natural language processing.

The paper investigates how single-head attention learns to attend to salient words by approximating training as a two-stage process, showing that knowledge of word translations drives attention learning, and demonstrates that without this knowledge, models fail on simple tasks like copying input words.

Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word `i` to `o` if they co-occur frequently. Later, the model learns to attend to `i` while the correct output is $o$ because it knows `i` translates to `o`. To formalize, we define a model property, Knowledge to Translate Individual Words (KTIW) (e.g. knowing that `i` translates to `o`), and claim that it drives the learning of the attention. This claim is supported by the fact that before the attention mechanism is learned, KTIW can be learned from word co-occurrence statistics, but not the other way around. Particularly, we can construct a training distribution that makes KTIW hard to learn, the learning of the attention fails, and the model cannot even learn the simple task of copying the input words to the output. Our approximation explains why models sometimes attend to salient words, and inspires a toy example where a multi-head attention model can overcome the above hard training distribution by improving learning dynamics rather than expressiveness. We end by discussing the limitation of our approximation framework and suggest future directions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes