CLApr 25, 2023

Lessons Learned from a Citizen Science Project for Natural Language Processing

arXiv:2304.12836v1269 citationsh-index: 81
Originality Synthesis-oriented
AI Analysis

This addresses the problem of costly and difficult-to-scale data annotation for NLP researchers, though it is incremental as it builds on existing crowdsourcing methods.

The study explored using citizen science as an alternative to paid crowdsourcing for NLP annotation tasks, finding it can produce high-quality annotations and engage motivated volunteers, but requires addressing scalability, participation, and ethical issues.

Many Natural Language Processing (NLP) systems use annotated corpora for training and evaluation. However, labeled data is often costly to obtain and scaling annotation projects is difficult, which is why annotation tasks are often outsourced to paid crowdworkers. Citizen Science is an alternative to crowdsourcing that is relatively unexplored in the context of NLP. To investigate whether and how well Citizen Science can be applied in this setting, we conduct an exploratory study into engaging different groups of volunteers in Citizen Science for NLP by re-annotating parts of a pre-existing crowdsourced dataset. Our results show that this can yield high-quality annotations and attract motivated volunteers, but also requires considering factors such as scalability, participation over time, and legal and ethical issues. We summarize lessons learned in the form of guidelines and provide our code and data to aid future work on Citizen Science.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes