SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification
This work addresses the need for better datasets to improve offensive language detection in social media, though it is incremental as it builds on an existing dataset.
The authors tackled the problem of limited and potentially biased data for offensive language identification by introducing SOLID, a large-scale semi-supervised dataset of over nine million English tweets, which when combined with the existing OLID dataset, yielded sizable performance gains on the OLID test set, especially for lower taxonomy levels.
The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression. Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification that provides meaningful information for understanding the type and the target of offensive messages. However, it is limited in size and it might be biased towards offensive language as it was collected using keywords. In this work, we present SOLID, an expanded dataset, where the tweets were collected in a more principled manner. SOLID contains over nine million English tweets labeled in a semi-supervised fashion. We demonstrate that using SOLID along with OLID yields sizable performance gains on the OLID test set for two different models, especially for the lower levels of the taxonomy.