CLAug 4, 2020

NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer

Hwijeen Ahn, Jimin Sun, Chan Young Park, Jungyun Seo

arXiv:2008.01354v131.1999 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of detecting offensive content across languages for social media moderation, but it is incremental as it builds on existing methods like mBERT.

The paper tackled offensive language detection in a multilingual setting by using data augmentation and cross-lingual transfer, achieving competitive results in Greek, Danish, and Turkish at OffensEval 2020.

This paper describes our approach to the task of identifying offensive languages in a multilingual setting. We investigate two data augmentation strategies: using additional semi-supervised labels with different thresholds and cross-lingual transfer with data selection. Leveraging the semi-supervised dataset resulted in performance improvements compared to the baseline trained solely with the manually-annotated dataset. We propose a new metric, Translation Embedding Distance, to measure the transferability of instances for cross-lingual data selection. We also introduce various preprocessing steps tailored for social media text along with methods to fine-tune the pre-trained multilingual BERT (mBERT) for offensive language identification. Our multilingual systems achieved competitive results in Greek, Danish, and Turkish at OffensEval 2020.

View on arXiv PDF Code

Similar