CLSep 21, 2022

Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning

Oxford
arXiv:2209.10193v1584 citationsh-index: 28
Originality Incremental advance
AI Analysis

This work addresses the need for more efficient annotation processes in abusive language detection, which is an incremental improvement focusing on data efficiency rather than just effectiveness.

The paper tackles the problem of expensive and harmful annotation in abusive language detection by demonstrating that transformers-based active learning can achieve high effectiveness while requiring only a fraction of labeled data, especially when abusive content is a small percentage of the dataset.

Annotating abusive language is expensive, logistically complex and creates a risk of psychological harm. However, most machine learning research has prioritized maximizing effectiveness (i.e., F1 or accuracy score) rather than data efficiency (i.e., minimizing the amount of data that is annotated). In this paper, we use simulated experiments over two datasets at varying percentages of abuse to demonstrate that transformers-based active learning is a promising approach to substantially raise efficiency whilst still maintaining high effectiveness, especially when abusive content is a smaller percentage of the dataset. This approach requires a fraction of labeled data to reach performance equivalent to training over the full dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes