COMBO: A Complete Benchmark for Open KG Canonicalization
This work addresses the redundancy and ambiguity in open knowledge graphs for researchers in natural language processing and knowledge representation, though it is incremental as it builds on existing datasets by adding new annotations and metrics.
The authors tackled the problem of canonicalizing open knowledge graphs by introducing COMBO, a benchmark that provides gold canonicalization for relation phrases and ontology-level noun phrases, along with source sentences and evaluation metrics. They found that using pretrained language models to encode phrases improves relation and ontology-level canonicalization, achieving better performance compared to previous methods.
Open knowledge graph (KG) consists of (subject, relation, object) triples extracted from millions of raw text. The subject and object noun phrases and the relation in open KG have severe redundancy and ambiguity and need to be canonicalized. Existing datasets for open KG canonicalization only provide gold entity-level canonicalization for noun phrases. In this paper, we present COMBO, a Complete Benchmark for Open KG canonicalization. Compared with existing datasets, we additionally provide gold canonicalization for relation phrases, gold ontology-level canonicalization for noun phrases, as well as source sentences from which triples are extracted. We also propose metrics for evaluating each type of canonicalization. On the COMBO dataset, we empirically compare previously proposed canonicalization methods as well as a few simple baseline methods based on pretrained language models. We find that properly encoding the phrases in a triple using pretrained language models results in better relation canonicalization and ontology-level canonicalization of the noun phrase. We release our dataset, baselines, and evaluation scripts at https://github.com/jeffchy/COMBO/tree/main.