CLOct 15, 2021

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

arXiv:2110.08190v4645 citations
Originality Incremental advance
AI Analysis

This addresses a counter-traditional issue in model pruning for NLP practitioners, though it appears incremental as it builds on existing pruning and distillation techniques.

The paper tackles the problem of overfitting in Transformer-based language models when pruning is applied during fine-tuning under the pretrain-and-finetune paradigm, and shows that reducing overfitting improves pruning performance, with experiments on the GLUE benchmark demonstrating that their method outperforms leading competitors across tasks.

Conventional wisdom in pruning Transformer-based language models is that pruning reduces the model expressiveness and thus is more likely to underfit rather than overfit. However, under the trending pretrain-and-finetune paradigm, we postulate a counter-traditional hypothesis, that is: pruning increases the risk of overfitting when performed at the fine-tuning phase. In this paper, we aim to address the overfitting problem and improve pruning performance via progressive knowledge distillation with error-bound properties. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Ablation studies and experiments on the GLUE benchmark show that our method outperforms the leading competitors across different tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes