CXP949 at WNUT-2020 Task 2: Extracting Informative COVID-19 Tweets -- RoBERTa Ensembles and The Continued Relevance of Handcrafted Features
This work addresses the challenge of classifying noisy, domain-specific social media data for applications like public health monitoring, but it is incremental as it builds on existing methods with minor improvements.
The paper tackled the problem of extracting informative COVID-19 tweets from noisy user-generated text by enhancing a fine-tuned RoBERTa model with ensembles and handcrafted features, achieving a score within 2 points of the top team.
This paper presents our submission to Task 2 of the Workshop on Noisy User-generated Text. We explore improving the performance of a pre-trained transformer-based language model fine-tuned for text classification through an ensemble implementation that makes use of corpus level information and a handcrafted feature. We test the effectiveness of including the aforementioned features in accommodating the challenges of a noisy data set centred on a specific subject outside the remit of the pre-training data. We show that inclusion of additional features can improve classification results and achieve a score within 2 points of the top performing team.