LGCYMLJun 24, 2024

Data Debiasing with Datamodels (D3M): Improving Subgroup Robustness via Data Selection

arXiv:2406.16846v15 citations
Originality Highly original
AI Analysis

This addresses the issue of subgroup robustness for machine learning practitioners, offering a method that is more efficient and annotation-free compared to existing techniques like dataset balancing.

The paper tackles the problem of machine learning models failing on underrepresented subgroups by introducing Data Debiasing with Datamodels (D3M), which isolates and removes specific training examples that cause failures on minority groups, enabling efficient training of debiased classifiers with minimal data removal and no need for group annotations or hyperparameter tuning.

Machine learning models can fail on subgroups that are underrepresented during training. While techniques such as dataset balancing can improve performance on underperforming groups, they require access to training group annotations and can end up removing large portions of the dataset. In this paper, we introduce Data Debiasing with Datamodels (D3M), a debiasing approach which isolates and removes specific training examples that drive the model's failures on minority groups. Our approach enables us to efficiently train debiased classifiers while removing only a small number of examples, and does not require training group annotations or additional hyperparameter tuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes