On the Ambiguity of Rank-Based Evaluation of Entity Alignment or Link Prediction Methods
This addresses evaluation ambiguity in knowledge graph enrichment tasks, which is incremental as it refines existing practices rather than introducing new methods.
The paper identifies shortcomings in rank-based evaluation metrics for entity alignment and link prediction methods, showing that existing scores are not comparable across datasets and are sensitive to test set size, leading to potentially misleading conclusions. It proposes adjustments to enable fair, comparable, and interpretable performance assessment.
In this work, we take a closer look at the evaluation of two families of methods for enriching information from knowledge graphs: Link Prediction and Entity Alignment. In the current experimental setting, multiple different scores are employed to assess different aspects of model performance. We analyze the informativeness of these evaluation measures and identify several shortcomings. In particular, we demonstrate that all existing scores can hardly be used to compare results across different datasets. Moreover, we demonstrate that varying size of the test size automatically has impact on the performance of the same model based on commonly used metrics for the Entity Alignment task. We show that this leads to various problems in the interpretation of results, which may support misleading conclusions. Therefore, we propose adjustments to the evaluation and demonstrate empirically how this supports a fair, comparable, and interpretable assessment of model performance. Our code is available at https://github.com/mberr/rank-based-evaluation.