CLOct 7, 2020

Galileo at SemEval-2020 Task 12: Multi-lingual Learning for Offensive Language Identification using Pre-trained Language Models

Shuohuan Wang, Jiaxiang Liu, Xuan Ouyang, Yu Sun

arXiv:2010.03542v131.11002 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of identifying and classifying offensive content online, which is crucial for social media platforms and content moderation, though it is incremental as it builds on existing pre-trained language models.

The paper tackled the problem of detecting and categorizing offensive language in social media across multiple languages, achieving first place in all three sub-tasks of SemEval-2020 Task 12, including top average F1 scores in offensive language identification.

This paper describes Galileo's performance in SemEval-2020 Task 12 on detecting and categorizing offensive language in social media. For Offensive Language Identification, we proposed a multi-lingual method using Pre-trained Language Models, ERNIE and XLM-R. For offensive language categorization, we proposed a knowledge distillation method trained on soft labels generated by several supervised models. Our team participated in all three sub-tasks. In Sub-task A - Offensive Language Identification, we ranked first in terms of average F1 scores in all languages. We are also the only team which ranked among the top three across all languages. We also took the first place in Sub-task B - Automatic Categorization of Offense Types and Sub-task C - Offence Target Identification.

View on arXiv PDF

Similar