CLAIMay 30, 2025

Drop Dropout on Single-Epoch Language Model Pretraining

Stanford
arXiv:2505.24788v11 citationsh-index: 4Has CodeACL
Originality Incremental advance
AI Analysis

This addresses the optimization of large language model pretraining by showing dropout is unnecessary in single-epoch settings, which is incremental but relevant for efficiency and performance in NLP.

The study investigated the impact of dropout in single-epoch pretraining of language models, finding that removing dropout improves downstream performance across tasks like language modeling, BLiMP, SQuAD, and MNLI, and enhances gradient-based model editing.

Originally, dropout was seen as a breakthrough regularization technique that reduced overfitting and improved performance in almost all applications of deep learning by reducing overfitting. Yet, single-epoch pretraining tasks common to modern LLMs yield minimal overfitting, leading to dropout not being used for large LLMs. Nevertheless, no thorough empirical investigation has been done on the role of dropout in LM pretraining. Through experiments in single-epoch pretraining of both masked (BERT) and autoregressive (Pythia 160M and 1.4B) LMs with varying levels of dropout, we find that downstream performance in language modeling, morpho-syntax (BLiMP), question answering (SQuAD), and natural-language inference (MNLI) improves when dropout is not applied during pretraining. We additionally find that the recently-introduced "early dropout" also degrades performance over applying no dropout at all. We further investigate the models' editability, and find that models trained without dropout are more successful in gradient-based model editing (MEND) and equivalent in representation-based model editing (ReFT). Therefore, we advocate to drop dropout during single-epoch pretraining.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes