ID-XCB: Data-independent Debiasing for Fair and Accurate Transformer-based Cyberbullying Detection
This addresses fairness and accuracy issues in cyberbullying detection for social media platforms, though it is incremental as it builds on existing debiasing approaches.
The paper tackled biases in cyberbullying detection models caused by spurious associations with swear words in datasets, introducing ID-XCB, a data-independent debiasing technique that outperformed state-of-the-art methods in both detection performance and bias mitigation.
Swear words are a common proxy to collect datasets with cyberbullying incidents. Our focus is on measuring and mitigating biases derived from spurious associations between swear words and incidents occurring as a result of such data collection strategies. After demonstrating and quantifying these biases, we introduce ID-XCB, the first data-independent debiasing technique that combines adversarial training, bias constraints and debias fine-tuning approach aimed at alleviating model attention to bias-inducing words without impacting overall model performance. We explore ID-XCB on two popular session-based cyberbullying datasets along with comprehensive ablation and generalisation studies. We show that ID-XCB learns robust cyberbullying detection capabilities while mitigating biases, outperforming state-of-the-art debiasing methods in both performance and bias mitigation. Our quantitative and qualitative analyses demonstrate its generalisability to unseen data.