SciBERT: A Pretrained Language Model for Scientific Text
This addresses the problem of expensive data annotation for researchers in scientific domains, though it is incremental as it adapts an existing method to new data.
The authors tackled the challenge of limited annotated data for scientific NLP by releasing SciBERT, a pretrained language model based on BERT, which achieved statistically significant improvements and new state-of-the-art results on tasks like sequence tagging and sentence classification.
Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.