Talking-Heads Attention
This incremental improvement addresses efficiency and effectiveness in transformer architectures for natural language processing tasks.
The paper tackles the problem of improving multi-head attention in transformer models by introducing 'talking-heads attention', which adds linear projections across attention heads before and after softmax, resulting in better perplexities on masked language modeling and enhanced performance in transfer-learning tasks like language comprehension and question answering.
We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation.While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks.