CL AISep 11, 2024

Mapping Biomedical Ontology Terms to IDs: Effect of Domain Prevalence on Prediction Accuracy

Thanh Son Do, Daniel B. Hier, Tayo Obafemi-Ajayi

arXiv:2409.13746v22.72 citationsh-index: 14

Originality Incremental advance

AI Analysis

This work addresses limitations in LLMs for biomedical ontology mapping, highlighting the need to consider domain prevalence in training and evaluation, which is incremental but important for improving accuracy in low-prevalence cases.

This study evaluated how the prevalence of biomedical ontology IDs in literature affects large language models' accuracy in mapping terms to IDs, finding that higher prevalence strongly predicts better accuracy for HPO, GO, and UniProtKB mappings, but not for HUGO gene symbols where GPT-4 achieved 95% accuracy regardless of prevalence.

This study evaluates the ability of large language models (LLMs) to map biomedical ontology terms to their corresponding ontology IDs across the Human Phenotype Ontology (HPO), Gene Ontology (GO), and UniProtKB terminologies. Using counts of ontology IDs in the PubMed Central (PMC) dataset as a surrogate for their prevalence in the biomedical literature, we examined the relationship between ontology ID prevalence and mapping accuracy. Results indicate that ontology ID prevalence strongly predicts accurate mapping of HPO terms to HPO IDs, GO terms to GO IDs, and protein names to UniProtKB accession numbers. Higher prevalence of ontology IDs in the biomedical literature correlated with higher mapping accuracy. Predictive models based on receiver operating characteristic (ROC) curves confirmed this relationship. In contrast, this pattern did not apply to mapping protein names to Human Genome Organisation's (HUGO) gene symbols. GPT-4 achieved a high baseline performance (95%) in mapping protein names to HUGO gene symbols, with mapping accuracy unaffected by prevalence. We propose that the high prevalence of HUGO gene symbols in the literature has caused these symbols to become lexicalized, enabling GPT-4 to map protein names to HUGO gene symbols with high accuracy. These findings highlight the limitations of LLMs in mapping ontology terms to low-prevalence ontology IDs and underscore the importance of incorporating ontology ID prevalence into the training and evaluation of LLMs for biomedical applications.

View on arXiv PDF

Similar