CL LGMay 31, 2021

Cascaded Head-colliding Attention

arXiv:2105.14850v131.4711 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses parameter inefficiency in Transformers for NLP tasks, but it is incremental as it builds on existing MHA frameworks.

The paper tackles the problem of redundant attention heads in Transformers, which wastes model capacity, by proposing cascaded head-colliding attention (CODA) to improve parameter efficiency, resulting in a 0.6 perplexity reduction on Wikitext-103 and a 0.6 BLEU improvement on WMT14 EN-DE.

Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks. At the cornerstone of the Transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence. Despite its massive success, the current framework ignores interactions among different heads, leading to the problem that many of the heads are redundant in practice, which greatly wastes the capacity of the model. To improve parameter efficiency, we re-formulate the MHA as a latent variable model from a probabilistic perspective. We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by $0.6$ perplexity on \texttt{Wikitext-103} in language modeling, and by $0.6$ BLEU on \texttt{WMT14 EN-DE} in machine translation, due to its improvements on the parameter efficiency.\footnote{Our implementation is publicly available at \url{https://github.com/LZhengisme/CODA}.}

View on arXiv PDF Code

Similar