CLMay 23, 2023

BiasX: "Thinking Slow" in Toxic Content Moderation with Explanations of Implied Social Biases

Yiming Zhang, Sravani Nanduri, Liwei Jiang, Tongshuang Wu, Maarten Sap

arXiv:2305.13589v121.5135 citationsh-index: 40

Originality Incremental advance

AI Analysis

This addresses content moderation challenges for platforms and annotators, offering an incremental improvement through explanation-based interventions.

The paper tackles the problem of subtle toxicity being missed and harmless content being over-detected in content moderation by introducing BiasX, a framework that provides free-text explanations of implied social biases, showing that participants improved accuracy by up to +7.2% with expert-written explanations.

Toxicity annotators and content moderators often default to mental shortcuts when making decisions. This can lead to subtle toxicity being missed, and seemingly toxic but harmless content being over-detected. We introduce BiasX, a framework that enhances content moderation setups with free-text explanations of statements' implied social biases, and explore its effectiveness through a large-scale crowdsourced user study. We show that indeed, participants substantially benefit from explanations for correctly identifying subtly (non-)toxic content. The quality of explanations is critical: imperfect machine-generated explanations (+2.4% on hard toxic examples) help less compared to expert-written human explanations (+7.2%). Our results showcase the promise of using free-text explanations to encourage more thoughtful toxicity moderation.

View on arXiv PDF

Similar