Cross-lingual Inductive Transfer to Detect Offensive Language
This addresses the need for cross-lingual offensive language detection for social media users and platforms, but it is incremental as it builds on existing models and datasets.
The paper tackles the problem of detecting offensive language in social media across multiple languages by introducing a cross-lingual inductive approach using XLM-RoBERTa, achieving competitive results such as an F1-score of 0.919 in English and 0.781 in Turkish on the mOLID dataset.
With the growing use of social media and its availability, many instances of the use of offensive language have been observed across multiple languages and domains. This phenomenon has given rise to the growing need to detect the offensive language used in social media cross-lingually. In OffensEval 2020, the organizers have released the \textit{multilingual Offensive Language Identification Dataset} (mOLID), which contains tweets in five different languages, to detect offensive language. In this work, we introduce a cross-lingual inductive approach to identify the offensive language in tweets using the contextual word embedding \textit{XLM-RoBERTa} (XLM-R). We show that our model performs competitively on all five languages, obtaining the fourth position in the English task with an F1-score of $0.919$ and eighth position in the Turkish task with an F1-score of $0.781$. Further experimentation proves that our model works competitively in a zero-shot learning environment, and is extensible to other languages.