Hitachi at SemEval-2020 Task 12: Offensive Language Identification with Noisy Labels using Statistical Sampling and Post-Processing
This work addresses the problem of identifying offensive language in social media data with noisy labels, which is incremental as it builds on existing methods like BERT with minor modifications.
The paper tackled offensive language identification from noisy labels by developing a hybrid system using BERT with statistical sampling and post-processing, achieving a Macro-F1 score of 0.90913 and ranking 34th in the competition.
In this paper, we present our participation in SemEval-2020 Task-12 Subtask-A (English Language) which focuses on offensive language identification from noisy labels. To this end, we developed a hybrid system with the BERT classifier trained with tweets selected using Statistical Sampling Algorithm (SA) and Post-Processed (PP) using an offensive wordlist. Our developed system achieved 34 th position with Macro-averaged F1-score (Macro-F1) of 0.90913 over both offensive and non-offensive classes. We further show comprehensive results and error analysis to assist future research in offensive language identification with noisy labels.