LG AIApr 11, 2025

On Large-scale Evaluation of Embedding Models for Knowledge Graph Completion

Nasim Shirvani-Mahdavi, Farahnaz Akrami, Chengkai Li

arXiv:2504.08970v27.13 citationsh-index: 6

Originality Incremental advance

AI Analysis

This addresses the issue of unreliable evaluation for researchers and practitioners in knowledge graph completion, though it is incremental as it builds on existing protocols.

The paper tackles the problem of unrealistic benchmarks and flawed evaluation metrics in knowledge graph embedding models for knowledge graph completion, revealing substantial performance variations between small and large datasets and systematic overestimation of model capabilities when n-ary relations are binarized.

Knowledge graph embedding (KGE) models are extensively studied for knowledge graph completion, yet their evaluation remains constrained by unrealistic benchmarks. Standard evaluation metrics rely on the closed-world assumption, which penalizes models for correctly predicting missing triples, contradicting the fundamental goals of link prediction. These metrics often compress accuracy assessment into a single value, obscuring models' specific strengths and weaknesses. The prevailing evaluation protocol, link prediction, operates under the unrealistic assumption that an entity's properties, for which values are to be predicted, are known in advance. While alternative protocols such as property prediction, entity-pair ranking, and triple classification address some of these limitations, they remain underutilized. Moreover, commonly used datasets are either faulty or too small to reflect real-world data. Few studies examine the role of mediator nodes, which are essential for modeling n-ary relationships, or investigate model performance variation across domains. This paper conducts a comprehensive evaluation of four representative KGE models on large-scale datasets FB-CVT-REV and FB+CVT-REV. Our analysis reveals critical insights, including substantial performance variations between small and large datasets, both in relative rankings and absolute metrics, systematic overestimation of model capabilities when n-ary relations are binarized, and fundamental limitations in current evaluation protocols and metrics.

View on arXiv PDF

Similar