Model Debiasing by Learnable Data Augmentation
This addresses the issue of model debiasing in unsupervised scenarios, which is crucial for improving generalization in real-world applications where bias is unknown, though it appears incremental as it builds on existing data augmentation and regularization techniques.
The paper tackles the problem of learning from biased data without bias annotations by proposing a two-stage pipeline that identifies biased/unbiased samples and uses learnable data augmentation to regularize training. The method achieves state-of-the-art classification accuracy on synthetic and realistic biased datasets, improving generalization even for unbiased data.
Deep Neural Networks are well known for efficiently fitting training data, yet experiencing poor generalization capabilities whenever some kind of bias dominates over the actual task labels, resulting in models learning "shortcuts". In essence, such models are often prone to learn spurious correlations between data and labels. In this work, we tackle the problem of learning from biased data in the very realistic unsupervised scenario, i.e., when the bias is unknown. This is a much harder task as compared to the supervised case, where auxiliary, bias-related annotations, can be exploited in the learning process. This paper proposes a novel 2-stage learning pipeline featuring a data augmentation strategy able to regularize the training. First, biased/unbiased samples are identified by training over-biased models. Second, such subdivision (typically noisy) is exploited within a data augmentation framework, properly combining the original samples while learning mixing parameters, which has a regularization effect. Experiments on synthetic and realistic biased datasets show state-of-the-art classification accuracy, outperforming competing methods, ultimately proving robust performance on both biased and unbiased examples. Notably, being our training method totally agnostic to the level of bias, it also positively affects performance for any, even apparently unbiased, dataset, thus improving the model generalization regardless of the level of bias (or its absence) in the data.