CLJun 5, 2019

The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English

Nikola Ljubešić, Darja Fišer, Tomaž Erjavec

arXiv:1906.02045v21.351 citations

Originality Synthesis-oriented

AI Analysis

This provides a resource for researchers studying and combating SUD in Slovene and English, though it is incremental as it builds on existing datasets with improved comparability.

The authors introduced the FRENK datasets, which consist of Facebook comment threads in Slovene and English on migrants and LGBT topics, manually annotated for socially unacceptable discourse (SUD) to address the lack of comparable cross-language data. They achieved this by using identical sampling procedures and a detailed annotation schema covering six SUD types and five targets, with analysis of distributions and inter-annotator agreements.

In this paper we present datasets of Facebook comment threads to mainstream media posts in Slovene and English developed inside the Slovene national project FRENK which cover two topics, migrants and LGBT, and are manually annotated for different types of socially unacceptable discourse (SUD). The main advantages of these datasets compared to the existing ones are identical sampling procedures, producing comparable data across languages and an annotation schema that takes into account six types of SUD and five targets at which SUD is directed. We describe the sampling and annotation procedures, and analyze the annotation distributions and inter-annotator agreements. We consider this dataset to be an important milestone in understanding and combating SUD for both languages.

View on arXiv PDF

Similar