CLMTRL-SCIDec 10, 2022

Structured information extraction from complex scientific text with fine-tuned large language models

Princeton
arXiv:2212.05238v1117 citationsh-index: 158
Originality Incremental advance
AI Analysis

This provides an accessible method for researchers inexperienced with NLP to build large databases from unstructured scientific text, though it is incremental as it applies existing fine-tuning techniques to new domains.

The authors tackled the problem of extracting structured information from complex scientific text by fine-tuning GPT-3 on 500 prompt-completion pairs, achieving accurate extraction for tasks like linking dopants with host materials and cataloging metal-organic frameworks in materials chemistry.

Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes