Unsupervised Pre-training for Biomedical Question Answering
This work addresses the challenge of enhancing QA accuracy in the biomedical domain, though it is incremental as it builds on existing pre-training methods.
The authors tackled the problem of improving biomedical question answering by introducing a new unsupervised pre-training task that corrupts context by replacing biomedical entities, which boosted BioBERT's performance and outperformed the previous best model in the BioASQ challenge.
We explore the suitability of unsupervised representation learning methods on biomedical text -- BioBERT, SciBERT, and BioSentVec -- for biomedical question answering. To further improve unsupervised representations for biomedical QA, we introduce a new pre-training task from unlabeled data designed to reason about biomedical entities in the context. Our pre-training method consists of corrupting a given context by randomly replacing some mention of a biomedical entity with a random entity mention and then querying the model with the correct entity mention in order to locate the corrupted part of the context. This de-noising task enables the model to learn good representations from abundant, unlabeled biomedical text that helps QA tasks and minimizes the train-test mismatch between the pre-training task and the downstream QA tasks by requiring the model to predict spans. Our experiments show that pre-training BioBERT on the proposed pre-training task significantly boosts performance and outperforms the previous best model from the 7th BioASQ Task 7b-Phase B challenge.