CL MMMay 26

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

Mohammad Amine Jradi, Faeze Ghorbanpour, Alexander Fraser

arXiv:2605.2702585.5

Predicted impact top 60% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners in hate speech detection, this work provides a structured method to improve LLM alignment with human judgments, though the approach is incremental.

The paper analyzes LLM alignment with human hate speech annotations across ten subjective attributes, finding that explicit behavioral dimensions correlate well while evaluative dimensions are inverted. A confidence-weighted Ridge regression combining attribute predictions achieves R² up to 0.71, outperforming direct prompting.

Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attributes, such as dehumanization, violence, and sentiment, evaluating both small and large variants of Llama 3.1 and Qwen 2.5. Our analysis reveals a consistent split across all models: behaviorally explicit dimensions (insult, humiliate, attack-defend) correlate strongly with human annotations, while evaluative dimensions (respect, sentiment, hate speech) are systematically inverted. Demographic persona conditioning reduces model confidence without improving alignment. Building on these insights, we propose combining attribute-level LLM predictions via a confidence-weighted Ridge regression to reconstruct continuous hate speech scores from the Measuring Hate Speech corpus, achieving $R^2$ of up to 0.71 and outperforming direct prompting baselines, demonstrating that structured attribute decomposition recovers a richer and more human-aligned signal than end-to-end label prediction alone.

View on arXiv PDF

Similar