CVDec 16, 2022

Better May Not Be Fairer: A Study on Subgroup Discrepancy in Image Classification

CMU
arXiv:2212.08649v29 citationsh-index: 32
Originality Incremental advance
AI Analysis

This work addresses fairness and robustness issues in image classification for AI practitioners, though it is incremental as it builds on existing data augmentation and spurious correlation research.

The study investigates how natural background colors as spurious features affect subgroup discrepancies in image classification, finding that overall accuracy does not ensure consistent subgroup performance, and proposes FlowAug, a semantic data augmentation method that improves subgroup consistency and generalization, with metrics showing better robustness.

In this paper, we provide 20,000 non-trivial human annotations on popular datasets as a first step to bridge gap to studying how natural semantic spurious features affect image classification, as prior works often study datasets mixing low-level features due to limitations in accessing realistic datasets. We investigate how natural background colors play a role as spurious features by annotating the test sets of CIFAR10 and CIFAR100 into subgroups based on the background color of each image. We name our datasets \textbf{CIFAR10-B} and \textbf{CIFAR100-B} and integrate them with CIFAR-Cs. We find that overall human-level accuracy does not guarantee consistent subgroup performances, and the phenomenon remains even on models pre-trained on ImageNet or after data augmentation (DA). To alleviate this issue, we propose \textbf{FlowAug}, a \emph{semantic} DA that leverages decoupled semantic representations captured by a pre-trained generative flow. Experimental results show that FlowAug achieves more consistent subgroup results than other types of DA methods on CIFAR10/100 and on CIFAR10/100-C. Additionally, it shows better generalization performance. Furthermore, we propose a generic metric, \emph{MacroStd}, for studying model robustness to spurious correlations, where we take a macro average on the weighted standard deviations across different classes. We show \textit{MacroStd} being more predictive of better performances; per our metric, FlowAug demonstrates improvements on subgroup discrepancy. Although this metric is proposed to study our curated datasets, it applies to all datasets that have subgroups or subclasses. Lastly, we also show superior out-of-distribution results on CIFAR10.1.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes