CL AIAug 23, 2022

K-MHaS: A Multi-label Hate Speech Detection Dataset in Korean Online News Comment

Jean Lee, Taejun Lim, Heejun Lee, Bogeun Jo, Yangsok Kim, Heegeun Yoon, Soyeon Caren Han

arXiv:2208.10684v331.1586 citationsh-index: 21Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited hate speech detection resources for Korean language users, though it is incremental as it extends existing methods to a new dataset.

The authors tackled the lack of non-English hate speech detection resources by introducing K-MHaS, a multi-label dataset of 109k Korean news comments, and found that KR-BERT with a sub-character tokenizer outperformed other models in evaluation.

Online hate speech detection has become an important issue due to the growth of online content, but resources in languages other than English are extremely limited. We introduce K-MHaS, a new multi-label dataset for hate speech detection that effectively handles Korean language patterns. The dataset consists of 109k utterances from news comments and provides a multi-label classification using 1 to 4 labels, and handles subjectivity and intersectionality. We evaluate strong baseline experiments on K-MHaS using Korean-BERT-based language models with six different metrics. KR-BERT with a sub-character tokenizer outperforms others, recognizing decomposed characters in each hate speech class.

View on arXiv PDF Code

Similar