LG NE SD AS MLMar 5, 2020

Talking-Heads Attention

Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, Le Hou

arXiv:2003.02436v119.5102 citationsHas Code

Originality Incremental advance

AI Analysis

This incremental improvement addresses efficiency and effectiveness in transformer architectures for natural language processing tasks.

The paper tackles the problem of improving multi-head attention in transformer models by introducing 'talking-heads attention', which adds linear projections across attention heads before and after softmax, resulting in better perplexities on masked language modeling and enhanced performance in transfer-learning tasks like language comprehension and question answering.

We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation.While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks.

View on arXiv PDF Code

Similar