Towards Accurate and Consistent Evaluation: A Dataset for Distantly-Supervised Relation Extraction
This addresses evaluation inconsistencies for researchers in relation extraction, though it is incremental as it builds on existing datasets.
The paper tackles the problem of inaccurate evaluation in distantly-supervised relation extraction due to wrong labels in automatically generated datasets, by building NYT-H, a dataset with human-annotated test data, and shows that system rankings differ between DS-labelled and human-annotated tests.
In recent years, distantly-supervised relation extraction has achieved a certain success by using deep neural networks. Distant Supervision (DS) can automatically generate large-scale annotated data by aligning entity pairs from Knowledge Bases (KB) to sentences. However, these DS-generated datasets inevitably have wrong labels that result in incorrect evaluation scores during testing, which may mislead the researchers. To solve this problem, we build a new dataset NYTH, where we use the DS-generated data as training data and hire annotators to label test data. Compared with the previous datasets, NYT-H has a much larger test set and then we can perform more accurate and consistent evaluation. Finally, we present the experimental results of several widely used systems on NYT-H. The experimental results show that the ranking lists of the comparison systems on the DS-labelled test data and human-annotated test data are different. This indicates that our human-annotated data is necessary for evaluation of distantly-supervised relation extraction.