CLAIJan 14

Improving Implicit Hate Speech Detection via a Community-Driven Multi-Agent Framework

arXiv:2601.09342v1h-index: 5
Originality Highly original
AI Analysis

This work addresses the problem of detecting implicitly hateful speech for online moderation, offering a novel approach that enhances accuracy and fairness across demographic groups.

The paper tackled implicit hate speech detection by proposing a community-driven multi-agent framework that integrates socio-cultural context, achieving state-of-the-art performance and improved fairness on the ToxiGen dataset.

This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes