CL SIDec 9, 2019

Analysis of the Ethiopic Twitter Dataset for Abusive Speech in Amharic

Seid Muhie Yimam, Abinew Ali Ayele, Chris Biemann

arXiv:1912.04419v10.618 citations

Originality Synthesis-oriented

AI Analysis

This work addresses abusive speech detection for Amharic speakers on Twitter, but it is incremental as it focuses on dataset analysis without introducing new methods.

The authors tackled the problem of recognizing abusive speech in Amharic by analyzing the first Ethiopic Twitter dataset, finding distributions and tendencies over time and comparing it to a general reference corpus.

In this paper, we present an analysis of the first Ethiopic Twitter Dataset for the Amharic language targeted for recognizing abusive speech. The dataset has been collected since 2014 that is written in Fidel script. Since several languages can be written using the Fidel script, we have used the existing Amharic, Tigrinya and Ge'ez corpora to retain only the Amharic tweets. We have analyzed the tweets for abusive speech content with the following targets: Analyze the distribution and tendency of abusive speech content over time and compare the abusive speech content between a Twitter and general reference Amharic corpus.

View on arXiv PDF

Similar