CL LGJan 17, 2024

Machines Do See Color: A Guideline to Classify Different Forms of Racist Discourse in Large Corpora

Diana Davila Gordillo, Joan C. Timoneda, Sebastian Vallejo Vera

arXiv:2401.09333v32 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work addresses the challenge of detecting subtle and overt racist language in large datasets, which is important for researchers and practitioners in social science and NLP, though it is incremental as it builds on existing cross-lingual models.

The authors tackled the problem of identifying and classifying racist discourse in large text corpora by developing a generalizable guideline and applying XLM-RoBERTa, showing that their model outperforms state-of-the-art approaches in this task.

Current methods to identify and classify racist language in text rely on small-n qualitative approaches or large-n approaches focusing exclusively on overt forms of racist discourse. This article provides a step-by-step generalizable guideline to identify and classify different forms of racist discourse in large corpora. In our approach, we start by conceptualizing racism and its different manifestations. We then contextualize these racist manifestations to the time and place of interest, which allows researchers to identify their discursive form. Finally, we apply XLM-RoBERTa (XLM-R), a cross-lingual model for supervised text classification with a cutting-edge contextual understanding of text. We show that XLM-R and XLM-R-Racismo, our pretrained model, outperform other state-of-the-art approaches in classifying racism in large corpora. We illustrate our approach using a corpus of tweets relating to the Ecuadorian indígena community between 2018 and 2021.

View on arXiv PDF

Similar