Revisiting Structured Dropout
This addresses overfitting for practitioners using large neural networks in NLP and computer vision, but it is incremental as it builds on existing structured dropout approaches.
The paper tackled the problem of overfitting in neural networks by revisiting structured dropout methods, proposing ProbDropBlock which drops contiguous blocks based on feature salience, and showed it consistently improves performance, e.g., increasing RoBERTa on MNLI by 0.22% and ResNet50 on ImageNet by 0.28%.
Large neural networks are often overparameterised and prone to overfitting, Dropout is a widely used regularization technique to combat overfitting and improve model generalization. However, unstructured Dropout is not always effective for specific network architectures and this has led to the formation of multiple structured Dropout approaches to improve model performance and, sometimes, reduce the computational resources required for inference. In this work, we revisit structured Dropout comparing different Dropout approaches to natural language processing and computer vision tasks for multiple state-of-the-art networks. Additionally, we devise an approach to structured Dropout we call \textbf{\emph{ProbDropBlock}} which drops contiguous blocks from feature maps with a probability given by the normalized feature salience values. We find that with a simple scheduling strategy the proposed approach to structured Dropout consistently improved model performance compared to baselines and other Dropout approaches on a diverse range of tasks and models. In particular, we show \textbf{\emph{ProbDropBlock}} improves RoBERTa finetuning on MNLI by $0.22\%$, and training of ResNet50 on ImageNet by $0.28\%$.