CLApr 4, 2021

IITK@Detox at SemEval-2021 Task 5: Semi-Supervised Learning and Dice Loss for Toxic Spans Detection

arXiv:2104.01566v1711 citations
Originality Synthesis-oriented
AI Analysis

This work addresses toxicity attribution in text for content moderation, but it is incremental as it builds on existing transformer models with minor modifications.

The paper tackled toxic spans detection with a small, imbalanced dataset by using semi-supervised learning and Self-Adjusting Dice Loss, achieving a ninth-place ranking on the SemEval-2021 leaderboard.

In this work, we present our approach and findings for SemEval-2021 Task 5 - Toxic Spans Detection. The task's main aim was to identify spans to which a given text's toxicity could be attributed. The task is challenging mainly due to two constraints: the small training dataset and imbalanced class distribution. Our paper investigates two techniques, semi-supervised learning and learning with Self-Adjusting Dice Loss, for tackling these challenges. Our submitted system (ranked ninth on the leader board) consisted of an ensemble of various pre-trained Transformer Language Models trained using either of the above-proposed techniques.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes