When BERT Plays the Lottery, All Tickets Are Winning
This work addresses model compression for NLP practitioners, but it is incremental as it builds on existing pruning and lottery ticket hypothesis research without introducing a new paradigm.
The paper tackles the problem of reducing large Transformer-based models like BERT by applying the lottery ticket hypothesis with structured and magnitude pruning, showing that subnetworks can achieve performance comparable to the full model while similarly-sized random subnetworks perform worse, with structured pruning revealing that most pre-trained weights remain highly trainable.
Large Transformer-based models were shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis, using both structured and magnitude pruning. For fine-tuned BERT, we show that (a) it is possible to find subnetworks achieving performance that is comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. Strikingly, with structured pruning even the worst possible subnetworks remain highly trainable, indicating that most pre-trained BERT weights are potentially useful. We also study the "good" subnetworks to see if their success can be attributed to superior linguistic knowledge, but find them unstable, and not explained by meaningful self-attention patterns.