LGNEMLFeb 12, 2020

GLU Variants Improve Transformer

arXiv:2002.05202v11905 citations
AI Analysis

This work addresses performance improvements for Transformer models, but it is incremental as it builds on existing GLU and Transformer architectures.

The paper tested variations of Gated Linear Units (GLU) in Transformer feed-forward sublayers, finding that some variants improved model quality over standard activations like ReLU or GELU.

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

Code Implementations27 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes