CLNov 11, 2022

CoRAL: a Context-aware Croatian Abusive Language Dataset

Ravi Shekhar, Mladen Karan, Matthew Purver

arXiv:2211.06053v124.2298 citationsh-index: 38

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of semi-automated comment moderation for Croatian social media, which is incremental as it focuses on a specific language and cultural context.

The authors tackled the problem of detecting abusive language in Croatian social media comments, especially when it is implicit and context-dependent, by creating the CoRAL dataset and showing that current models degrade significantly in performance when comments require language skill and context knowledge.

In light of unprecedented increases in the popularity of the internet and social media, comment moderation has never been a more relevant task. Semi-automated comment moderation systems greatly aid human moderators by either automatically classifying the examples or allowing the moderators to prioritize which comments to consider first. However, the concept of inappropriate content is often subjective, and such content can be conveyed in many subtle and indirect ways. In this work, we propose CoRAL -- a language and culturally aware Croatian Abusive dataset covering phenomena of implicitness and reliance on local and global context. We show experimentally that current models degrade when comments are not explicit and further degrade when language skill and context knowledge are required to interpret the comment.

View on arXiv PDF

Similar