CLMar 26, 2019

SciBERT: A Pretrained Language Model for Scientific Text

arXiv:1903.10676v33720 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of expensive data annotation for researchers in scientific domains, though it is incremental as it adapts an existing method to new data.

The authors tackled the challenge of limited annotated data for scientific NLP by releasing SciBERT, a pretrained language model based on BERT, which achieved statistically significant improvements and new state-of-the-art results on tasks like sequence tagging and sentence classification.

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

View on arXiv PDF Code

Similar