LGMLAug 30, 2019

Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel

arXiv:1908.11775v4326 citations
Originality Incremental advance
AI Analysis

This work provides a unified framework for analyzing and enhancing attention in Transformers, which is incremental but offers practical benefits for sequence learning tasks like machine translation.

The paper tackles the challenge of understanding and improving the Transformer's attention mechanism by reformulating it through a kernel perspective, leading to a new attention variant that achieves competitive performance with less computation.

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel. To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs. This new formulation gives us a better way to understand individual components of the Transformer's attention, such as the better way to integrate the positional embedding. Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer's attention. As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels. This approach achieves competitive performance to the current state of the art model with less computation. In our experiments, we empirically study different kernel construction strategies on two widely used tasks: neural machine translation and sequence prediction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes