Counterfactually Augmented Data and Unintended Bias: The Case of Sexism and Hate Speech Detection
This addresses unintended bias in AI models for content moderation, highlighting risks in over-relying on core features, with incremental implications for improving fairness in detection systems.
The study investigated whether Counterfactually Augmented Data (CAD) improves model robustness in sexism and hate speech detection, finding that models trained on CAD, particularly construct-driven CAD, had higher false-positive rates on challenging non-hateful data, but using diverse CAD types reduced this unintended bias.
Counterfactually Augmented Data (CAD) aims to improve out-of-domain generalizability, an indicator of model robustness. The improvement is credited with promoting core features of the construct over spurious artifacts that happen to correlate with it. Yet, over-relying on core features may lead to unintended model bias. Especially, construct-driven CAD -- perturbations of core features -- may induce models to ignore the context in which core features are used. Here, we test models for sexism and hate speech detection on challenging data: non-hateful and non-sexist usage of identity and gendered terms. In these hard cases, models trained on CAD, especially construct-driven CAD, show higher false-positive rates than models trained on the original, unperturbed data. Using a diverse set of CAD -- construct-driven and construct-agnostic -- reduces such unintended bias.