CLSep 4, 2023

Hateful Messages: A Conversational Data Set of Hate Speech produced by Adolescents on Discord

Jan Fillies, Silvio Peikert, Adrian Paschke

arXiv:2309.01413v11.37 citations

Originality Synthesis-oriented

AI Analysis

This addresses the problem of improving generalizability in automated hate speech detection for adolescents, who are active social media users, by providing a domain-specific dataset, though it is incremental as it focuses on data collection rather than new methods.

The researchers tackled the bias of youth language in hate speech classification by creating a modern, anonymized dataset of 88,395 annotated chat messages from Discord, with 6.42% classified as hate speech and an average author age under 20 years.

With the rise of social media, a rise of hateful content can be observed. Even though the understanding and definitions of hate speech varies, platforms, communities, and legislature all acknowledge the problem. Therefore, adolescents are a new and active group of social media users. The majority of adolescents experience or witness online hate speech. Research in the field of automated hate speech classification has been on the rise and focuses on aspects such as bias, generalizability, and performance. To increase generalizability and performance, it is important to understand biases within the data. This research addresses the bias of youth language within hate speech classification and contributes by providing a modern and anonymized hate speech youth language data set consisting of 88.395 annotated chat messages. The data set consists of publicly available online messages from the chat platform Discord. ~6,42% of the messages were classified by a self-developed annotation schema as hate speech. For 35.553 messages, the user profiles provided age annotations setting the average author age to under 20 years old.

View on arXiv PDF

Similar