CLMar 16, 2020

Developing a Multilingual Annotated Corpus of Misogyny and Aggression

Shiladitya Bhattacharya, Siddharth Singh, Ritesh Kumar, Akanksha Bansal, Akash Bhagat, Yogesh Dawer, Bornini Lahiri, Atul Kr. Ojha

arXiv:2003.07428v131.41007 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the need for datasets to study and automatically detect misogyny and aggression in social media, particularly for Indian languages, but it is incremental as it focuses on corpus creation and baseline experiments.

The paper tackles the problem of identifying misogyny and aggression in social media by developing a multilingual annotated corpus of over 20,000 comments in Indian English, Hindi, and Indian Bangla, and reports baseline classifier results for misogyny detection in these languages.

In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.

View on arXiv PDF

Similar