CLSep 22, 2024
Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property PredictionsHongchen Wang, Kangming Li, Scott Ramsay et al.
Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. In this study, we evaluate the performance and robustness of LLMs for materials science, focusing on domain-specific question answering and materials property prediction across diverse real-world and adversarial conditions. Three distinct datasets are used in this study: 1) a set of multiple-choice questions from undergraduate-level materials science courses, 2) a dataset including various steel compositions and yield strengths, and 3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of 'noise', ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study showcases unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance recovery from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.
43.2MTRL-SCIMay 6
Building informative materials datasets beyond targeted objectivesRafael Espinosa Castañeda, Ashley Dale, Hongchen Wang et al.
Materials science data collection can be expensive, making the reuse and long-term utility of datasets critical important for future discovery campaigns. In practice, researchers prioritize a subset of properties due to research interests. However, ignoring a subset of outcomes in data collection campaigns potentially generate datasets poorly suited for future learning tasks. Here, we present a framework for dataset construction that maximizes informativeness for target properties of interest while preserving performance on untargeted ones. Our approach uses diversity-aware selection to ensure broad coverage of the materials space. In noisy experimental dataset construction, we find that without our diversity-aware framework, prediction performance on untargeted properties can degrade by up to 40% relative to random sampling, whereas applying our framework yields improvements of up to 10% . For targeted properties, performance can degrade with respect to random sampling by up to 12.5% without diversity, while our framework achieves gains of up to 25%. Incorporating diversity into dataset construction not only preserves informativeness for the targeted properties, but also improves materials coverage for potential future objectives. As a result, the constructed datasets remain broadly informative across considered and unconsidered outcomes, ensuring unbiased quality entries and mitigating cold-start limitations in subsequent modeling and discovery campaigns.
MTRL-SCIOct 31, 2024
LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property PredictionAndre Niyongabo Rubungo, Kangming Li, Jason Hattrick-Simpers et al. · princeton
Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.
MTRL-SCINov 24, 2025
High-throughput validation of phase formability and simulation accuracy of Cantor alloysChangjun Cheng, Daniel Persaud, Kangming Li et al.
High-throughput methods enable accelerated discovery of novel materials in complex systems such as high-entropy alloys, which exhibit intricate phase stability across vast compositional spaces. Computational approaches, including Density Functional Theory (DFT) and calculation of phase diagrams (CALPHAD), facilitate screening of phase formability as a function of composition and temperature. However, the integration of computational predictions with experimental validation remains challenging in high-throughput studies. In this work, we introduce a quantitative confidence metric to assess the agreement between predictions and experimental observations, providing a quantitative measure of the confidence of machine learning models trained on either DFT or CALPHAD input in accounting for experimental evidence. The experimental dataset was generated via high-throughput in-situ synchrotron X-ray diffraction on compositionally varied FeNiMnCr alloy libraries, heated from room temperature to ~1000 °C. Agreement between the observed and predicted phases was evaluated using either temperature-independent phase classification or a model that incorporates a temperature-dependent probability of phase formation. This integrated approach demonstrates where strong overall agreement between computation and experiment exists, while also identifying key discrepancies, particularly in FCC/BCC predictions at Mn-rich regions to inform future model refinement.
MTRL-SCINov 21, 2025
When Active Learning Fails, Uncalibrated Out of Distribution Uncertainty Quantification Might Be the ProblemAshley S. Dale, Kangming Li, Brian DeCost et al.
Efficiently and meaningfully estimating prediction uncertainty is important for exploration in active learning campaigns in materials discovery, where samples with high uncertainty are interpreted as containing information missing from the model. In this work, the effect of different uncertainty estimation and calibration methods are evaluated for active learning when using ensembles of ALIGNN, eXtreme Gradient Boost, Random Forest, and Neural Network model architectures. We compare uncertainty estimates from ALIGNN deep ensembles to loss landscape uncertainty estimates obtained for solubility, bandgap, and formation energy prediction tasks. We then evaluate how the quality of the uncertainty estimate impacts an active learning campaign that seeks model generalization to out-of-distribution data. Uncertainty calibration methods were found to variably generalize from in-domain data to out-of-domain data. Furthermore, calibrated uncertainties were generally unsuccessful in reducing the amount of data required by a model to improve during an active learning campaign on out-of-distribution data when compared to random sampling and uncalibrated uncertainties. The impact of poor-quality uncertainty persists for random forest and eXtreme Gradient Boosting models trained on the same data for the same tasks, indicating that this is at least partially intrinsic to the data and not due to model capacity alone. Analysis of the target, in-distribution uncertainty, out-of-distribution uncertainty, and training residual distributions suggest that future work focus on understanding empirical uncertainties in the feature input space for cases where ensemble prediction variances do not accurately capture the missing information required for the model to generalize.
LGJul 15, 2025
Exploring the Frontiers of kNN Noisy Feature Detection and Recovery for Self-Driving LabsQiuyu Shi, Kangming Li, Yao Fehlis et al.
Self-driving laboratories (SDLs) have shown promise to accelerate materials discovery by integrating machine learning with automated experimental platforms. However, errors in the capture of input parameters may corrupt the features used to model system performance, compromising current and future campaigns. This study develops an automated workflow to systematically detect noisy features, determine sample-feature pairings that can be corrected, and finally recover the correct feature values. A systematic study is then performed to examine how dataset size, noise intensity, and feature value distribution affect both the detectability and recoverability of noisy features. In general, high-intensity noise and large training datasets are conducive to the detection and correction of noisy features. Low-intensity noise reduces detection and recovery but can be compensated for by larger clean training data sets. Detection and correction results vary between features with continuous and dispersed feature distributions showing greater recoverability compared to features with discrete or narrow distributions. This systematic study not only demonstrates a model agnostic framework for rational data recovery in the presence of noise, limited data, and differing feature distributions but also provides a tangible benchmark of kNN imputation in materials data sets. Ultimately, it aims to enhance data quality and experimental precision in automated materials discovery.