CLApr 3, 2019

Probing Biomedical Embeddings from Language Models

Qiao Jin, Bhuwan Dhingra, William W. Cohen, Xinghua Lu

arXiv:1904.02181v131.51132 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of understanding embedding quality for biomedical NLP researchers, but it is incremental as it focuses on probing existing models without introducing new methods.

The paper investigates the intrinsic information captured by biomedical language model embeddings through probing experiments, finding that BioELMo outperforms BioBERT as a fixed feature extractor due to better encoding of entity-type and relational information.

Contextualized word embeddings derived from pre-trained language models (LMs) show significant improvements on downstream NLP tasks. Pre-training on domain-specific corpora, such as biomedical articles, further improves their performance. In this paper, we conduct probing experiments to determine what additional information is carried intrinsically by the in-domain trained contextualized embeddings. For this we use the pre-trained LMs as fixed feature extractors and restrict the downstream task models to not have additional sequence modeling layers. We compare BERT, ELMo, BioBERT and BioELMo, a biomedical version of ELMo trained on 10M PubMed abstracts. Surprisingly, while fine-tuned BioBERT is better than BioELMo in biomedical NER and NLI tasks, as a fixed feature extractor BioELMo outperforms BioBERT in our probing tasks. We use visualization and nearest neighbor analysis to show that better encoding of entity-type and relational information leads to this superiority.

View on arXiv PDF Code

Similar