Identifying epidemic related Tweets using noisy learning
This work addresses the annotation bottleneck in supervised learning for public health monitoring, but it is incremental as it applies existing noisy learning methods to a specific domain.
The authors tackled the problem of labor-intensive manual annotation for supervised learning by applying noisy learning theory to generate weak supervision signals for identifying epidemic-related tweets, achieving model performance greater than 90% on a large epidemic corpus.
Supervised learning algorithms are heavily reliant on annotated datasets to train machine learning models. However, the curation of the annotated datasets is laborious and time consuming due to the manual effort involved and has become a huge bottleneck in supervised learning. In this work, we apply the theory of noisy learning to generate weak supervision signals instead of manual annotation. We curate a noisy labeled dataset using a labeling heuristic to identify epidemic related tweets. We evaluated the performance using a large epidemic corpus and our results demonstrate that models trained with noisy data in a class imbalanced and multi-classification weak supervision setting achieved performance greater than 90%.