CL LGJun 17, 2021

Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study

Rahul Nadkarni, David Wadden, Iz Beltagy, Noah A. Smith, Hannaneh Hajishirzi, Tom Hope

arXiv:2106.09700v23.833 citationsh-index: 85Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of biomedical knowledge base completion for applications like drug design, but it is incremental as it builds on existing methods by exploring domain-specific language models.

The study tackled the problem of predicting missing links in biomedical knowledge graphs by fine-tuning scientific language models and integrating them with KG embedding models using a router method, resulting in a substantial performance boost and demonstrating advantages in inductive settings with novel entities.

Biomedical knowledge graphs (KGs) hold rich information on entities such as diseases, drugs, and genes. Predicting missing links in these graphs can boost many important applications, such as drug design and repurposing. Recent work has shown that general-domain language models (LMs) can serve as "soft" KGs, and that they can be fine-tuned for the task of KG completion. In this work, we study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction. We evaluate several domain-specific LMs, fine-tuning them on datasets centered on drugs and diseases that we represent as KGs and enrich with textual entity descriptions. We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance. Finally, we demonstrate the advantage of LM models in the inductive setting with novel scientific entities. Our datasets and code are made publicly available.

View on arXiv PDF Code

Similar