CL AIJan 14

Improving Implicit Hate Speech Detection via a Community-Driven Multi-Agent Framework

Ewelina Gajewska, Katarzyna Budzynska, Jarosław A Chudziak

arXiv:2601.09342v1h-index: 5

Originality Highly original

AI Analysis

This work addresses the problem of detecting implicitly hateful speech for online moderation, offering a novel approach that enhances accuracy and fairness across demographic groups.

The paper tackled implicit hate speech detection by proposing a community-driven multi-agent framework that integrates socio-cultural context, achieving state-of-the-art performance and improved fairness on the ToxiGen dataset.

This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.

View on arXiv PDF

Similar