CLOct 22, 2024

Data-driven Coreference-based Ontology Building

arXiv:2410.17051v124 citationsh-index: 45Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work provides a scalable, data-driven method for ontology construction in the biomedical domain, which could aid in knowledge organization and retrieval.

The paper tackled the problem of building a domain ontology by analyzing coreference relations across a large corpus of 30 million biomedical abstracts, resulting in a data-driven ontology that significantly overlaps with human-authored ones.

While coreference resolution is traditionally used as a component in individual document understanding, in this work we take a more global view and explore what can we learn about a domain from the set of all document-level coreference relations that are present in a large corpus. We derive coreference chains from a corpus of 30 million biomedical abstracts and construct a graph based on the string phrases within these chains, establishing connections between phrases if they co-occur within the same coreference chain. We then use the graph structure and the betweeness centrality measure to distinguish between edges denoting hierarchy, identity and noise, assign directionality to edges denoting hierarchy, and split nodes (strings) that correspond to multiple distinct concepts. The result is a rich, data-driven ontology over concepts in the biomedical domain, parts of which overlaps significantly with human-authored ontologies. We release the coreference chains and resulting ontology under a creative-commons license, along with the code.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes