CLLGApr 13, 2021

Lessons on Parameter Sharing across Layers in Transformers

arXiv:2104.06022v4242 citations
Originality Incremental advance
AI Analysis

This work addresses efficiency challenges in Transformer models for NLP practitioners, but it is incremental as it builds on prior parameter-sharing methods.

The paper tackles the problem of parameter sharing across layers in Transformers by proposing three strategies (Sequence, Cycle, and Cycle (rev)) to relax existing techniques like Universal Transformers, resulting in improved efficiency in parameter size and computational time, with effectiveness demonstrated in data-rich configurations like the WMT competition.

We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes