LGCLMLSep 25, 2019

Reducing Transformer Depth on Demand with Structured Dropout

arXiv:1909.11556v1707 citations
Originality Highly original
AI Analysis

This addresses the need for efficient and high-quality transformer models in NLP applications, offering a practical solution for reducing computational costs while maintaining performance.

The paper tackles the problem of overparameterized transformers being computationally expensive and prone to overfitting by introducing LayerDrop, a structured dropout method that enables selecting sub-networks of any depth at inference time without fine-tuning, achieving state-of-the-art improvements on tasks like machine translation and language modeling.

Overparameterized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout, which has a regularization effect during training and allows for efficient pruning at inference time. In particular, we show that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. We demonstrate the effectiveness of our approach by improving the state of the art on machine translation, language modeling, summarization, question answering, and language understanding benchmarks. Moreover, we show that our approach leads to small BERT-like models of higher quality compared to training from scratch or using distillation.

Code Implementations5 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes