CLAIApr 27, 2022

UBERT: A Novel Language Model for Synonymy Prediction at Scale in the UMLS Metathesaurus

arXiv:2204.12716v13 citationsh-index: 48
Originality Incremental advance
AI Analysis

This work addresses the problem of automating synonymy prediction for biomedical vocabulary integration, which is incremental as it builds on prior deep learning models like LexLM.

The paper tackles the error-prone and time-consuming process of clustering synonymous terms in the UMLS Metathesaurus by introducing UBERT, a BERT-based language model pretrained on UMLS terms via a supervised synonymy prediction task, which outperforms existing models like LexLM and biomedical BERT-based models on the UMLS Vocabulary Alignment task.

The UMLS Metathesaurus integrates more than 200 biomedical source vocabularies. During the Metathesaurus construction process, synonymous terms are clustered into concepts by human editors, assisted by lexical similarity algorithms. This process is error-prone and time-consuming. Recently, a deep learning model (LexLM) has been developed for the UMLS Vocabulary Alignment (UVA) task. This work introduces UBERT, a BERT-based language model, pretrained on UMLS terms via a supervised Synonymy Prediction (SP) task replacing the original Next Sentence Prediction (NSP) task. The effectiveness of UBERT for UMLS Metathesaurus construction process is evaluated using the UMLS Vocabulary Alignment (UVA) task. We show that UBERT outperforms the LexLM, as well as biomedical BERT-based models. Key to the performance of UBERT are the synonymy prediction task specifically developed for UBERT, the tight alignment of training data to the UVA task, and the similarity of the models used for pretrained UBERT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes