CLMTRL-SCISOFTSep 27, 2022

A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing

arXiv:2209.13136v1121 citationsh-index: 64
Originality Incremental advance
AI Analysis

This work addresses the challenge for materials scientists in efficiently accessing and analyzing property data from published literature, though it is incremental as it builds on existing NLP methods.

The researchers tackled the problem of extracting material property data from the growing polymer literature by developing an NLP-based pipeline, which processed ~130,000 abstracts in 60 hours to obtain ~300,000 records and made them available via a web platform.

The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from published literature. We used natural language processing (NLP) methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets when used as the encoder for text. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available through a web platform at https://polymerscholar.org which can be used to locate material property data recorded in abstracts conveniently. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with a complete set of extracted material property information.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes