Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model
This work addresses the lack of annotated data for chemical entity extraction in the chemistry field, which is incremental as it adapts existing NLP methods to a new domain-specific corpus.
The authors tackled the problem of extracting chemical entities and relations from scientific publications by building a new annotated corpus for chemical bonds and proposing a BERT-CRF model with joint entity and relation extraction. They achieved state-of-the-art and competitive NER performance on their Chemical Special Corpus.
Computational chemistry develops fast in recent years due to the rapid growth and breakthroughs in AI. Thanks for the progress in natural language processing, researchers can extract more fine-grained knowledge in publications to stimulate the development in computational chemistry. While the works and corpora in chemical entity extraction have been restricted in the biomedicine or life science field instead of the chemistry field, we build a new corpus in chemical bond field annotated for 7 types of entities: compound, solvent, method, bond, reaction, pKa and pKa value. This paper presents a novel BERT-CRF model to build scientific chemical data chains by extracting 7 chemical entities and relations from publications. And we propose a joint model to extract the entities and relations simultaneously. Experimental results on our Chemical Special Corpus demonstrate that we achieve state-of-art and competitive NER performance.