Haitao Wang

h-index12

4papers

1,014citations

Novelty16%

AI Score27

Ranked #156,475 of 194,257 authors (top 81%)#27,169 in CL (top 88%)

4 Papers

0.8CLDec 19, 2022

Statistical Dataset Evaluation: Reliability, Difficulty, and Validity

Chengwen Wang, Qingxiu Dong, Xiaochen Wang et al. · pku

Datasets serve as crucial training resources and model performance trackers. However, existing datasets have exposed a plethora of problems, inducing biased models and unreliable evaluation results. In this paper, we propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation. We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity, following a classical testing theory. Taking the Named Entity Recognition (NER) datasets as a case study, we introduce $9$ statistical metrics for a statistical dataset evaluation framework. Experimental results and human evaluation validate that our evaluation framework effectively assesses various aspects of the dataset quality. Furthermore, we study how the dataset scores on our statistical metrics affect the model performance, and appeal for dataset quality evaluation or targeted dataset improvement before training or testing models.

31.0CLOct 30, 2020Code

Towards Accurate and Consistent Evaluation: A Dataset for Distantly-Supervised Relation Extraction

Tong Zhu, Haitao Wang, Junjie Yu et al.

In recent years, distantly-supervised relation extraction has achieved a certain success by using deep neural networks. Distant Supervision (DS) can automatically generate large-scale annotated data by aligning entity pairs from Knowledge Bases (KB) to sentences. However, these DS-generated datasets inevitably have wrong labels that result in incorrect evaluation scores during testing, which may mislead the researchers. To solve this problem, we build a new dataset NYTH, where we use the DS-generated data as training data and hire annotators to label test data. Compared with the previous datasets, NYT-H has a much larger test set and then we can perform more accurate and consistent evaluation. Finally, we present the experimental results of several widely used systems on NYT-H. The experimental results show that the ranking lists of the comparison systems on the DS-labelled test data and human-annotated test data are different. This indicates that our human-annotated data is necessary for evaluation of distantly-supervised relation extraction.

0.2CLAug 29, 2019Code

CCKS 2019 Shared Task on Inter-Personal Relationship Extraction

Haitao Wang, Zhengqiu He, Tong Zhu et al.

The CCKS2019 shared task was devoted to inter-personal relationship extraction. Given two person entities and at least one sentence containing these two entities, participating teams are asked to predict the relationship between the entities according to a given relation list. This year, 358 teams from various universities and organizations participated in this task. In this paper, we present the task definition, the description of data and the evaluation methodology used during this shared task. We also present a brief overview of the various methods adopted by the participating teams. Finally, we present the evaluation results.

1.3CLJul 30, 2019Code

IPRE: a Dataset for Inter-Personal Relationship Extraction

Haitao Wang, Zhengqiu He, Jin Ma et al.

Inter-personal relationship is the basis of human society. In order to automatically identify the relations between persons from texts, we need annotated data for training systems. However, there is a lack of a massive amount of such data so far. To address this situation, we introduce IPRE, a new dataset for inter-personal relationship extraction which aims to facilitate information extraction and knowledge graph construction research. In total, IPRE has over 41,000 labeled sentences for 34 types of relations, including about 9,000 sentences annotated by workers. Our data is the first dataset for inter-personal relationship extraction. Additionally, we define three evaluation tasks based on IPRE and provide the baseline systems for further comparison in future work.