Exploring the Plausibility of Hate and Counter Speech Detectors with Explainable AI
This work addresses the need for interpretable AI in content moderation, but it is incremental as it compares existing explainability methods without introducing new techniques.
The paper investigated the explainability of transformer models for hate and counter speech detection, comparing four explainability approaches and finding that perturbation-based methods performed best, with explainability helping users better understand model predictions.
In this paper we investigate the explainability of transformer models and their plausibility for hate speech and counter speech detection. We compare representatives of four different explainability approaches, i.e., gradient-based, perturbation-based, attention-based, and prototype-based approaches, and analyze them quantitatively with an ablation study and qualitatively in a user study. Results show that perturbation-based explainability performs best, followed by gradient-based and attention-based explainability. Prototypebased experiments did not yield useful results. Overall, we observe that explainability strongly supports the users in better understanding the model predictions.