LGCLMLApr 17, 2020

Understanding the Difficulty of Training Transformers

arXiv:2004.08249v31095 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses a key bottleneck for researchers and practitioners in NLP by providing a more stable and efficient training method for Transformers, though it is incremental as it builds on existing architectures.

The paper investigates the instability in training Transformers, identifying an amplification effect from heavy dependency on residual branches as the cause, and proposes Admin, an adaptive model initialization method that stabilizes training, converges faster, and improves performance in experiments.

Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding designing cutting-edge optimizers and learning rate schedulers carefully (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand $\textit{what complicates Transformer training}$ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially -- for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin ($\textbf{Ad}$aptive $\textbf{m}$odel $\textbf{in}$itialization) to stabilize stabilize the early stage's training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance. Implementations are released at: https://github.com/LiyuanLucasLiu/Transforemr-Clinic.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes