CLLGApr 8, 2020

Analyzing Redundancy in Pretrained Transformer Models

arXiv:2004.04010v21033 citations
AI Analysis

This addresses efficiency issues for deploying large NLP models in resource-constrained environments, though it is incremental as it builds on existing redundancy analysis and transfer learning approaches.

The paper tackled the problem of redundancy in pretrained transformer models like BERT and XLNet, which have hundreds of millions of parameters limiting computational efficiency, and found that 85% of neurons are redundant and 92% can be removed for downstream tasks, with a method maintaining 97% performance using at most 10% of neurons.

Transformer-based deep NLP models are trained using hundreds of millions of parameters, limiting their applicability in computationally constrained environments. In this paper, we study the cause of these limitations by defining a notion of Redundancy, which we categorize into two classes: General Redundancy and Task-specific Redundancy. We dissect two popular pretrained models, BERT and XLNet, studying how much redundancy they exhibit at a representation-level and at a more fine-grained neuron-level. Our analysis reveals interesting insights, such as: i) 85% of the neurons across the network are redundant and ii) at least 92% of them can be removed when optimizing towards a downstream task. Based on our analysis, we present an efficient feature-based transfer learning procedure, which maintains 97% performance while using at-most 10% of the original neurons.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes