LGNESDASMLMar 5, 2020

Talking-Heads Attention

arXiv:2003.02436v1102 citations
Originality Incremental advance
AI Analysis

This incremental improvement addresses efficiency and effectiveness in transformer architectures for natural language processing tasks.

The paper tackles the problem of improving multi-head attention in transformer models by introducing 'talking-heads attention', which adds linear projections across attention heads before and after softmax, resulting in better perplexities on masked language modeling and enhanced performance in transfer-learning tasks like language comprehension and question answering.

We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation.While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes