CL HCJan 9, 2017

Crowdsourcing Ground Truth for Medical Relation Extraction

Anca Dumitrache, Lora Aroyo, Chris Welty

arXiv:1701.02185v25.071 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the need for scalable and cost-effective ground truth data in medical NLP, though it is incremental as it builds on existing crowdsourcing approaches.

The paper tackled the problem of gathering high-quality labeled data for medical relation extraction by proposing the CrowdTruth method, which leverages annotator disagreement to model ambiguity, resulting in data that matches expert quality at lower cost and outperforms distant supervision for training.

Cognitive computing systems require human labeled data for evaluation, and often for training. The standard practice used in gathering this data minimizes disagreement between annotators, and we have found this results in data that fails to account for the ambiguity inherent in language. We have proposed the CrowdTruth method for collecting ground truth through crowdsourcing, that reconsiders the role of people in machine learning based on the observation that disagreement between annotators provides a useful signal for phenomena such as ambiguity in the text. We report on using this method to build an annotated data set for medical relation extraction for the $cause$ and $treat$ relations, and how this data performed in a supervised training experiment. We demonstrate that by modeling ambiguity, labeled data gathered from crowd workers can (1) reach the level of quality of domain experts for this task while reducing the cost, and (2) provide better training data at scale than distant supervision. We further propose and validate new weighted measures for precision, recall, and F-measure, that account for ambiguity in both human and machine performance on this task.

View on arXiv PDF Code

Similar