Directions in Abusive Language Training Data: Garbage In, Garbage Out
It addresses the problem of inconsistent and low-quality training data for abusive language detection, which affects researchers and practitioners in NLP and online safety, but is incremental as it synthesizes existing knowledge.
The paper systematically reviews abusive language dataset creation and content, leading to evidence-based recommendations for practitioners working with this complex data.
Data-driven analysis and detection of abusive online content covers many different tasks, phenomena, contexts, and methodologies. This paper systematically reviews abusive language dataset creation and content in conjunction with an open website for cataloguing abusive language data. This collection of knowledge leads to a synthesis providing evidence-based recommendations for practitioners working with this complex and highly diverse data.