Training on Plausible Counterfactuals Removes Spurious Correlations
This addresses bias reduction in machine learning models for improved fairness, though it is incremental as it extends adversarial perturbation methods to plausible counterfactuals.
The study tackled the problem of spurious correlations in classifiers by training them on plausible counterfactual explanations labeled with incorrect classes, which resulted in high in-distribution accuracy and significantly reduced bias.
Plausible counterfactual explanations (p-CFEs) are perturbations that minimally modify inputs to change classifier decisions while remaining plausible under the data distribution. In this study, we demonstrate that classifiers can be trained on p-CFEs labeled with induced \emph{incorrect} target classes to classify unperturbed inputs with the original labels. While previous studies have shown that such learning is possible with adversarial perturbations, we extend this paradigm to p-CFEs. Interestingly, our experiments reveal that learning from p-CFEs is even more effective: the resulting classifiers achieve not only high in-distribution accuracy but also exhibit significantly reduced bias with respect to spurious correlations.