CL MTRL-SCIMay 27, 2025

Iterative Corpus Refinement for Materials Property Prediction Based on Scientific Texts

arXiv:2505.21646v21 citationsh-index: 3ECML/PKDD

Originality Incremental advance

AI Analysis

This work addresses the challenge of scarce data in materials science by providing a scalable tool for screening large compositional spaces, which is incremental as it builds on existing text mining and embedding methods.

The authors tackled the combinatorial explosion in materials discovery by developing an iterative corpus refinement framework that uses scientific texts to predict high-performing materials for oxygen reduction, hydrogen evolution, and oxygen evolution reactions, successfully identifying top compositions validated by experimental measurements.

The discovery and optimization of materials for specific applications is hampered by the practically infinite number of possible elemental combinations and associated properties, also known as the `combinatorial explosion'. By nature of the problem, data are scarce and all possible data sources should be used. In addition to simulations and experimental results, the latent knowledge in scientific texts is not yet used to its full potential. We present an iterative framework that refines a given scientific corpus by strategic selection of the most diverse documents, training Word2Vec models, and monitoring the convergence of composition-property correlations in embedding space. Our approach is applied to predict high-performing materials for oxygen reduction (ORR), hydrogen evolution (HER), and oxygen evolution (OER) reactions for a large number of possible candidate compositions. Our method successfully predicts the highest performing compositions among a large pool of candidates, validated by experimental measurements of the electrocatalytic performance in the lab. This work demonstrates and validates the potential of iterative corpus refinement to accelerate materials discovery and optimization, offering a scalable and efficient tool for screening large compositional spaces where reliable data are scarce or non-existent.

View on arXiv PDF

Similar