An Annotated Corpus of Arabic Tweets for Hate Speech Analysis
This provides a dataset for researchers and practitioners working on hate speech detection in Arabic, addressing dialectal variations, but it is incremental as it applies existing methods to new data.
The study tackled the challenge of identifying hate speech in Arabic tweets by creating a multilabel dataset of 10,000 annotated tweets, achieving an inter-annotator agreement of 0.86 for offensive content and 0.71 for hate speech targets, with AraBERTv2 achieving a micro-F1 score of 0.7865 and accuracy of 0.786 in evaluation.
Identifying hate speech content in the Arabic language is challenging due to the rich quality of dialectal variations. This study introduces a multilabel hate speech dataset in the Arabic language. We have collected 10000 Arabic tweets and annotated each tweet, whether it contains offensive content or not. If a text contains offensive content, we further classify it into different hate speech targets such as religion, gender, politics, ethnicity, origin, and others. A text can contain either single or multiple targets. Multiple annotators are involved in the data annotation task. We calculated the inter-annotator agreement, which was reported to be 0.86 for offensive content and 0.71 for multiple hate speech targets. Finally, we evaluated the data annotation task by employing a different transformers-based model in which AraBERTv2 outperformed with a micro-F1 score of 0.7865 and an accuracy of 0.786.