The Effect of Model Size on Worst-Group Generalization
This addresses the problem of poor generalization on rare subgroups for practitioners in machine learning, offering practical advice to use larger pre-trained models, though it is incremental as it builds on existing overparameterization studies.
The paper investigates how model size affects worst-group generalization when subgroup information is unknown, finding that increasing model size does not harm and may improve performance across various architectures, domains, and initializations, with concrete improvements on datasets like Waterbirds and MultiNLI.
Overparameterization is shown to result in poor test accuracy on rare subgroups under a variety of settings where subgroup information is known. To gain a more complete picture, we consider the case where subgroup information is unknown. We investigate the effect of model size on worst-group generalization under empirical risk minimization (ERM) across a wide range of settings, varying: 1) architectures (ResNet, VGG, or BERT), 2) domains (vision or natural language processing), 3) model size (width or depth), and 4) initialization (with pre-trained or random weights). Our systematic evaluation reveals that increasing model size does not hurt, and may help, worst-group test performance under ERM across all setups. In particular, increasing pre-trained model size consistently improves performance on Waterbirds and MultiNLI. We advise practitioners to use larger pre-trained models when subgroup labels are unknown.