Language Models Enable Data-Augmented Synthesis Planning for Inorganic Materials
This work addresses the challenge of data scarcity in inorganic materials synthesis for researchers, offering a scalable and data-efficient hybrid approach that is incremental in combining language models with specialized training.
The paper tackles the problem of inorganic synthesis planning by using off-the-shelf language models to predict synthesis conditions, achieving up to 53.8% Top-1 precursor-prediction accuracy and mean absolute errors below 126°C for temperatures, and then employs these models to generate synthetic data that improves a specialized transformer model, SyntMTE, reducing errors to 73°C and 98°C and outperforming baselines by up to 8.7%.
Inorganic synthesis planning currently relies primarily on heuristic approaches or machine-learning models trained on limited datasets, which constrains its generality. We demonstrate that language models, without task-specific fine-tuning, can recall synthesis conditions. Off-the-shelf models, such as GPT-4.1, Gemini 2.0 Flash and Llama 4 Maverick, achieve a Top-1 precursor-prediction accuracy of up to 53.8 % and a Top-5 performance of 66.1 % on a held-out set of 1,000 reactions. They also predict calcination and sintering temperatures with mean absolute errors below 126 °C, matching specialized regression methods. Ensembling these language models further enhances predictive accuracy and reduces inference cost per prediction by up to 70 %. We subsequently employ language models to generate 28,548 synthetic reaction recipes, which we combine with literature-mined examples to pretrain a transformer-based model, SyntMTE. After fine-tuning on the combined dataset, SyntMTE reduces mean-absolute error in sintering temperature prediction to 73 °C and in calcination temperature to 98 °C. This strategy improves models by up to 8.7 % compared with baselines trained exclusively on experimental data. Finally, in a case study on Li7La3Zr2O12 solid-state electrolytes, we demonstrate that SyntMTE reproduces the experimentally observed dopant-dependent sintering trends. Our hybrid workflow enables scalable, data-efficient inorganic synthesis planning.