CLJun 26, 2023

Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement

Oxford
arXiv:2306.14704v311 citationsh-index: 91
Originality Synthesis-oriented
AI Analysis

This work addresses the need for better automated concept discovery and placement in biomedical ontologies, but it is incremental as it builds on existing datasets and methods.

The authors tackled the problem of automatically harvesting and placing new concepts from texts into knowledge bases by creating a new biomedical dataset that addresses limitations in existing datasets, such as lack of support for out-of-KB mention discovery and context, resulting in a benchmark adapted from MedMentions with SNOMED CT for evaluation.

Mentions of new concepts appear regularly in texts and require automated approaches to harvest and place them into Knowledge Bases (KB), e.g., ontologies and taxonomies. Existing datasets suffer from three issues, (i) mostly assuming that a new concept is pre-discovered and cannot support out-of-KB mention discovery; (ii) only using the concept label as the input along with the KB and thus lacking the contexts of a concept label; and (iii) mostly focusing on concept placement w.r.t a taxonomy of atomic concepts, instead of complex concepts, i.e., with logical operators. To address these issues, we propose a new benchmark, adapting MedMentions dataset (PubMed abstracts) with SNOMED CT versions in 2014 and 2017 under the Diseases sub-category and the broader categories of Clinical finding, Procedure, and Pharmaceutical / biologic product. We provide usage on the evaluation with the dataset for out-of-KB mention discovery and concept placement, adapting recent Large Language Model based methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes