CLApr 1, 2022

Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations

Tsinghua

arXiv:2204.00391v132.0639 citationsh-index: 22Has Code

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in biomedical knowledge graph construction, offering an incremental improvement for domain applications.

The paper tackled the problem of biomedical term clustering by addressing the insensitivity of existing term embeddings to minor textual differences, proposing CODER++ which adjusts sampling in contrastive learning to learn fine-grained representations, resulting in improved clustering performance as applied in the BIOS knowledge graph.

Term clustering is important in biomedical knowledge graph construction. Using similarities between terms embedding is helpful for term clustering. State-of-the-art term embeddings leverage pretrained language models to encode terms, and use synonyms and relation knowledge from knowledge graphs to guide contrastive learning. These embeddings provide close embeddings for terms belonging to the same concept. However, from our probing experiments, these embeddings are not sensitive to minor textual differences which leads to failure for biomedical term clustering. To alleviate this problem, we adjust the sampling strategy in pretraining term embeddings by providing dynamic hard positive and negative samples during contrastive learning to learn fine-grained representations which result in better biomedical term clustering. We name our proposed method as CODER++, and it has been applied in clustering biomedical concepts in the newly released Biomedical Knowledge Graph named BIOS.

View on arXiv PDF Code

Similar