About Evaluation of F1 Score for RECENT Relation Extraction System
This addresses the accuracy and reliability of evaluation metrics for relation extraction systems, but it is incremental as it focuses on correcting errors in a specific system's reported results.
The authors evaluated the F1 score of the RECENT relation extraction system, which initially claimed a state-of-the-art result of 75.2 on the TACRED dataset, but after error correction and reevaluation, the final result dropped to 65.16.
This document contains a discussion of the F1 score evaluation used in the article 'Relation Classification with Entity Type Restriction' by Shengfei Lyu, Huanhuan Chen published on Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. The authors created a system named RECENT and claim it achieves (then) a new state-of-the-art result 75.2 (previous 74.8) on the TACRED dataset, while after correcting errors and reevaluation the final result is 65.16