Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!
This meta-analysis addresses methodological rigor problems for researchers in relation extraction, though it is incremental as it builds on existing evaluation setup distinctions.
The paper identifies patterns of invalid performance comparisons in end-to-end relation extraction research and quantifies that the most common mistake leads to overestimating performance by about 5% on the ACE05 dataset.
Despite efforts to distinguish three different evaluation setups (Bekoulis et al., 2018), numerous end-to-end Relation Extraction (RE) articles present unreliable performance comparison to previous work. In this paper, we first identify several patterns of invalid comparisons in published papers and describe them to avoid their propagation. We then propose a small empirical study to quantify the impact of the most common mistake and evaluate it leads to overestimating the final RE performance by around 5% on ACE05. We also seize this opportunity to study the unexplored ablations of two recent developments: the use of language model pretraining (specifically BERT) and span-level NER. This meta-analysis emphasizes the need for rigor in the report of both the evaluation setting and the datasets statistics and we call for unifying the evaluation setting in end-to-end RE.