CLJan 7

Self-Explaining Hate Speech Detection with Moral Rationales

Francielle Vargas, Jackson Trager, Diego Alves, Surendrabikram Thapa, Matteo Guida, Berk Atil, Daryna Dementieva, Andrew Smart, Ameeta Agrawal

arXiv:2601.03481v11.62 citationsh-index: 21

Originality Incremental advance

AI Analysis

This addresses the need for more interpretable and robust hate speech detection, particularly for cultural contexts like Brazilian Portuguese, though it is incremental as it builds on existing rationale-supervised methods.

The paper tackled the problem of hate speech detection models relying on surface-level features by proposing SMRA, a self-explaining framework that incorporates moral rationales as supervision, resulting in improved performance (e.g., +0.9 and +1.5 F1) and enhanced explanation faithfulness (e.g., +7.4 pp IoU F1).

Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance (e.g., +0.9 and +1.5 F1, respectively) while substantially enhancing explanation faithfulness, increasing IoU F1 (+7.4 pp) and Token F1 (+5.0 pp). Although explanations become more concise, sufficiency improves (+2.3 pp) and fairness remains stable, indicating more faithful rationales without performance or bias trade-offs

View on arXiv PDF

Similar