CLAISep 6, 2023

On the Challenges of Building Datasets for Hate Speech Detection

arXiv:2309.02912v11 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses the lack of large, curated datasets for hate speech detection, which is a challenge for NLP practitioners due to the task's subjective nature.

The paper tackles the problem of building datasets for hate speech detection by analyzing issues through a data-centric lens and proposing a holistic framework for the data creation pipeline, using hate speech towards sexual minorities as an example.

Detection of hate speech has been formulated as a standalone application of NLP and different approaches have been adopted for identifying the target groups, obtaining raw data, defining the labeling process, choosing the detection algorithm, and evaluating the performance in the desired setting. However, unlike other downstream tasks, hate speech suffers from the lack of large-sized, carefully curated, generalizable datasets owing to the highly subjective nature of the task. In this paper, we first analyze the issues surrounding hate speech detection through a data-centric lens. We then outline a holistic framework to encapsulate the data creation pipeline across seven broad dimensions by taking the specific example of hate speech towards sexual minorities. We posit that practitioners would benefit from following this framework as a form of best practice when creating hate speech datasets in the future.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes