Large language model-enabled automated data extraction for concrete materials informatics

Zhanzhao Li, Kengran Yang, Qiyao He, Kai Gong

arXiv:2604.2293816.7

Predicted impact top 81% in MTRL-SCI · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the bottleneck of scarce experimental datasets in materials informatics by providing a scalable, generalizable data extraction method for the materials science community.

The authors developed an LLM-powered pipeline to automatically extract structured materials data from unstructured literature, achieving an F1 score of up to 0.97 and extracting nearly 9,000 records from over 27,000 publications in one hour, creating the largest open database for blended cement concrete.

The promise of data-driven materials discovery remains constrained by the scarcity of large, high-quality, and accessible experimental datasets. Here, we introduce a generalizable large language model (LLM)-powered pipeline for automated extraction and structuring of materials data from unstructured scientific literature, using concrete materials as a representative and particularly challenging example. The pipeline exhibits robust performance across a broad range of LLMs and achieves an $F_1$ score of up to 0.97 for diverse composition--process--property attributes. Within one hour, it extracts nearly 9,000 high-quality records with over 100 attributes screened from more than 27,000 publications, enabling the construction of the largest open laboratory database for blended cement concrete. Machine learning analyses underscore the importance of large, diverse, and information-rich datasets for enhancing both in-distribution accuracy and out-of-distribution generalization to unseen materials. The proposed pipeline is readily adaptable to other materials domains and accelerates the development of scalable data infrastructures for materials informatics.

View on arXiv PDF

Similar