CL LG SI MLMar 26, 2020

Integrating Crowdsourcing and Active Learning for Classification of Work-Life Events from Tweets

Yunpeng Zhao, Mattia Prosperi, Tianchen Lyu, Yi Guo, Jiang Bian

arXiv:2003.12139v20.52 citations

Originality Incremental advance

AI Analysis

This work addresses the resource-intensive process of manual annotation for social media research, offering a practical solution for researchers in NLP and computational social science, but it is incremental as it combines existing methods without major breakthroughs.

The authors tackled the problem of reducing manual annotation burden for creating gold-standard datasets from social media by integrating crowdsourcing with active learning, demonstrating in a case study on job loss tweets that crowdsourcing yields high-quality annotations and active learning reduces the number of tweets needed, though strategies performed similarly.

Social media, especially Twitter, is being increasingly used for research with predictive analytics. In social media studies, natural language processing (NLP) techniques are used in conjunction with expert-based, manual and qualitative analyses. However, social media data are unstructured and must undergo complex manipulation for research use. The manual annotation is the most resource and time-consuming process that multiple expert raters have to reach consensus on every item, but is essential to create gold-standard datasets for training NLP-based machine learning classifiers. To reduce the burden of the manual annotation, yet maintaining its reliability, we devised a crowdsourcing pipeline combined with active learning strategies. We demonstrated its effectiveness through a case study that identifies job loss events from individual tweets. We used Amazon Mechanical Turk platform to recruit annotators from the Internet and designed a number of quality control measures to assure annotation accuracy. We evaluated 4 different active learning strategies (i.e., least confident, entropy, vote entropy, and Kullback-Leibler divergence). The active learning strategies aim at reducing the number of tweets needed to reach a desired performance of automated classification. Results show that crowdsourcing is useful to create high-quality annotations and active learning helps in reducing the number of required tweets, although there was no substantial difference among the strategies tested.

View on arXiv PDF

Similar