LGCLCVOct 12, 2022

Foundation Transformers

CMUMicrosoft
arXiv:2210.06423v230 citationsh-index: 102
Originality Incremental advance
AI Analysis

This work addresses the problem of fragmented Transformer implementations for researchers and practitioners, proposing a general-purpose architecture, though it is incremental as it builds on existing Transformer variants.

The paper tackles the lack of a unified Transformer architecture across different modalities by introducing Magneto, a variant with Sub-LayerNorm and a derived initialization strategy, which achieves superior performance and stability in experiments across language, vision, speech, and multimodal tasks.

A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3).

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes