CLApr 27, 2022

On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

arXiv:2204.12708v1642 citationsh-index: 31
Originality Synthesis-oriented
AI Analysis

This addresses the problem of overfitting and poor generalization in NLP for researchers and practitioners, but it is incremental as it critiques existing methods without presenting new empirical results.

The paper argues that dataset balancing fails to eliminate spurious correlations in NLP models, as models exploit increasingly subtle correlations, and balancing may remove important signal like common sense. It suggests alternatives such as richer contexts, model abstention, and zero-shot setups.

Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy" instances (Sakaguchi et al., 2020), culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021). In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to "throw the baby out with the bathwater" and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes