LynyrdSkynyrd at WNUT-2020 Task 2: Semi-Supervised Learning for Identification of Informative COVID-19 English Tweets
This work addresses the specific challenge of filtering relevant pandemic information from social media for public health applications, representing an incremental improvement in a shared task setting.
The authors tackled the problem of identifying informative COVID-19 English tweets by developing an ensemble system combining traditional classifiers and pre-trained language models with pseudo-labeling for unlabeled data, achieving an F1-score of 0.9179 on validation and 0.8805 on blind test.
We describe our system for WNUT-2020 shared task on the identification of informative COVID-19 English tweets. Our system is an ensemble of various machine learning methods, leveraging both traditional feature-based classifiers as well as recent advances in pre-trained language models that help in capturing the syntactic, semantic, and contextual features from the tweets. We further employ pseudo-labelling to incorporate the unlabelled Twitter data released on the pandemic. Our best performing model achieves an F1-score of 0.9179 on the provided validation set and 0.8805 on the blind test-set.