CL LGMay 1, 2020

Hitachi at SemEval-2020 Task 12: Offensive Language Identification with Noisy Labels using Statistical Sampling and Post-Processing

Manikandan Ravikiran, Amin Ekant Muljibhai, Toshinori Miyoshi, Hiroaki Ozaki, Yuta Koreeda, Sakata Masayuki

arXiv:2005.00295v131.0993 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of identifying offensive language in social media data with noisy labels, which is incremental as it builds on existing methods like BERT with minor modifications.

The paper tackled offensive language identification from noisy labels by developing a hybrid system using BERT with statistical sampling and post-processing, achieving a Macro-F1 score of 0.90913 and ranking 34th in the competition.

In this paper, we present our participation in SemEval-2020 Task-12 Subtask-A (English Language) which focuses on offensive language identification from noisy labels. To this end, we developed a hybrid system with the BERT classifier trained with tweets selected using Statistical Sampling Algorithm (SA) and Post-Processed (PP) using an offensive wordlist. Our developed system achieved 34 th position with Macro-averaged F1-score (Macro-F1) of 0.90913 over both offensive and non-offensive classes. We further show comprehensive results and error analysis to assist future research in offensive language identification with noisy labels.

View on arXiv PDF

Similar