Toward Effective Automated Content Analysis via Crowdsourcing
This addresses the challenge of collecting high-quality subjective data at scale for researchers, though it is incremental as it builds on existing crowdsourcing methods.
The paper tackles the problem of deteriorating quality in crowdsourced semantic annotation by proposing a quality-aware system that provides timely feedback to workers, resulting in maintained labeling quality and enabling machine learning tasks with 70%-80% accuracy.
Many computer scientists use the aggregated answers of online workers to represent ground truth. Prior work has shown that aggregation methods such as majority voting are effective for measuring relatively objective features. For subjective features such as semantic connotation, online workers, known for optimizing their hourly earnings, tend to deteriorate in the quality of their responses as they work longer. In this paper, we aim to address this issue by proposing a quality-aware semantic data annotation system. We observe that with timely feedback on workers' performance quantified by quality scores, better informed online workers can maintain the quality of their labeling throughout an extended period of time. We validate the effectiveness of the proposed annotation system through i) evaluating performance based on an expert-labeled dataset, and ii) demonstrating machine learning tasks that can lead to consistent learning behavior with 70%-80% accuracy. Our results suggest that with our system, researchers can collect high-quality answers of subjective semantic features at a large scale.