LGFeb 17Code
MolCrystalFlow: Molecular Crystal Structure Prediction via Flow MatchingCheng Zeng, Harry W. Sullivan, Thomas Egg et al.
Molecular crystal structure prediction represents a grand challenge in computational chemistry due to large sizes of constituent molecules and complex intra- and intermolecular interactions. While generative modeling has revolutionized structure discovery for molecules, inorganic solids, and metal-organic frameworks, extending such approaches to fully periodic molecular crystals is still elusive. Here, we present MolCrystalFlow, a flow-based generative model for molecular crystal structure prediction. The framework disentangles intramolecular complexity from intermolecular packing by embedding molecules as rigid bodies and jointly learning the lattice matrix, molecular orientations, and centroid positions. Centroids and orientations are represented on their native Riemannian manifolds, allowing geodesic flow construction and graph neural network operations that respects geometric symmetries. We benchmark our model against state-of-the-art generative models for large-size periodic crystals and rule-based structure generation methods on two open-source molecular crystal datasets. We demonstrate an integration of MolCrystalFlow model with universal machine learning potential to accelerate molecular crystal structure prediction, paving the way for data-driven generative discovery of molecular crystals.
LGFeb 4, 2025Code
Open Materials Generation with Stochastic InterpolantsPhilipp Hoellmer, Thomas Egg, Maya M. Martirossyan et al.
The discovery of new materials is essential for enabling technological advancements. Computational approaches for predicting novel materials must effectively learn the manifold of stable crystal structures within an infinite design space. We introduce Open Materials Generation (OMatG), a unifying framework for the generative design and discovery of inorganic crystalline materials. OMatG employs stochastic interpolants (SI) to bridge an arbitrary base distribution to the target distribution of inorganic crystals via a broad class of tunable stochastic processes, encompassing both diffusion models and flow matching as special cases. In this work, we adapt the SI framework by integrating an equivariant graph representation of crystal structures and extending it to account for periodic boundary conditions in unit cell representations. Additionally, we couple the SI flow over spatial coordinates and lattice vectors with discrete flow matching for atomic species. We benchmark OMatG's performance on two tasks: Crystal Structure Prediction (CSP) for specified compositions, and 'de novo' generation (DNG) aimed at discovering stable, novel, and unique structures. In our ground-up implementation of OMatG, we refine and extend both CSP and DNG metrics compared to previous works. OMatG establishes a new state of the art in generative modeling for materials discovery, outperforming purely flow-based and diffusion-based implementations. These results underscore the importance of designing flexible deep learning frameworks to accelerate progress in materials science. The OMatG code is available at https://github.com/FERMat-ML/OMatG.
MTRL-SCISep 11, 2024
When More Data Hurts: Optimizing Data Coverage While Mitigating Diversity Induced Underfitting in an Ultra-Fast Machine-Learned PotentialJason B. Gibson, Tesia D. Janicki, Ajinkya C. Hire et al.
Machine-learned interatomic potentials (MLIPs) are becoming an essential tool in materials modeling. However, optimizing the generation of training data used to parameterize the MLIPs remains a significant challenge. This is because MLIPs can fail when encountering local enviroments too different from those present in the training data. The difficulty of determining \textit{a priori} the environments that will be encountered during molecular dynamics (MD) simulation necessitates diverse, high-quality training data. This study investigates how training data diversity affects the performance of MLIPs using the Ultra-Fast Force Field (UF$^3$) to model amorphous silicon nitride. We employ expert and autonomously generated data to create the training data and fit four force-field variants to subsets of the data. Our findings reveal a critical balance in training data diversity: insufficient diversity hinders generalization, while excessive diversity can exceed the MLIP's learning capacity, reducing simulation accuracy. Specifically, we found that the UF$^3$ variant trained on a subset of the training data, in which nitrogen-rich structures were removed, offered vastly better prediction and simulation accuracy than any other variant. By comparing these UF$^3$ variants, we highlight the nuanced requirements for creating accurate MLIPs, emphasizing the importance of application-specific training data to achieve optimal performance in modeling complex material behaviors.
SUPR-CONJan 29, 2024
Accelerating superconductor discovery through tempered deep learning of the electron-phonon spectral functionJason B. Gibson, Ajinkya C. Hire, Philip M. Dee et al.
Integrating deep learning with the search for new electron-phonon superconductors represents a burgeoning field of research, where the primary challenge lies in the computational intensity of calculating the electron-phonon spectral function, $α^2F(ω)$, the essential ingredient of Midgal-Eliashberg theory of superconductivity. To overcome this challenge, we adopt a two-step approach. First, we compute $α^2F(ω)$ for 818 dynamically stable materials. We then train a deep-learning model to predict $α^2F(ω)$, using an unconventional training strategy to temper the model's overfitting, enhancing predictions. Specifically, we train a Bootstrapped Ensemble of Tempered Equivariant graph neural NETworks (BETE-NET), obtaining an MAE of 0.21, 45 K, and 43 K for the Eliashberg moments derived from $α^2F(ω)$: $λ$, $ω_{\log}$, and $ω_{2}$, respectively, yielding an MAE of 2.5 K for the critical temperature, $T_c$. Further, we incorporate domain knowledge of the site-projected phonon density of states to impose inductive bias into the model's node attributes and enhance predictions. This methodological innovation decreases the MAE to 0.18, 29 K, and 28 K, respectively, yielding an MAE of 2.1 K for $T_c$. We illustrate the practical application of our model in high-throughput screening for high-$T_c$ materials. The model demonstrates an average precision nearly five times higher than random screening, highlighting the potential of ML in accelerating superconductor discovery. BETE-NET accelerates the search for high-$T_c$ superconductors while setting a precedent for applying ML in materials discovery, particularly when data is limited.
LGDec 13, 2025
MolGuidance: Advanced Guidance Strategies for Conditional Molecular Generation with Flow MatchingJirui Jin, Cheng Zeng, Pawan Prakash et al.
Key objectives in conditional molecular generation include ensuring chemical validity, aligning generated molecules with target properties, promoting structural diversity, and enabling efficient sampling for discovery. Recent advances in computer vision introduced a range of new guidance strategies for generative models, many of which can be adapted to support these goals. In this work, we integrate state-of-the-art guidance methods -- including classifier-free guidance, autoguidance, and model guidance -- in a leading molecule generation framework built on an SE(3)-equivariant flow matching process. We propose a hybrid guidance strategy that separately guides continuous and discrete molecular modalities -- operating on velocity fields and predicted logits, respectively -- while jointly optimizing their guidance scales via Bayesian optimization. Our implementation, benchmarked on the QM9 and QMe14S datasets, achieves new state-of-the-art performance in property alignment for de novo molecular generation. The generated molecules also exhibit high structural validity. Furthermore, we systematically compare the strengths and limitations of various guidance methods, offering insights into their broader applicability.
SUPR-CONSep 29, 2025
Guided Diffusion for the Discovery of New SuperconductorsPawan Prakash, Jason B. Gibson, Zhongwei Li et al.
The inverse design of materials with specific desired properties, such as high-temperature superconductivity, represents a formidable challenge in materials science due to the vastness of chemical and structural space. We present a guided diffusion framework to accelerate the discovery of novel superconductors. A DiffCSP foundation model is pretrained on the Alexandria Database and fine-tuned on 7,183 superconductors with first principles derived labels. Employing classifier-free guidance, we sample 200,000 structures, which lead to 34,027 unique candidates. A multistage screening process that combines machine learning and density functional theory (DFT) calculations to assess stability and electronic properties, identifies 773 candidates with DFT-calculated $T_\mathrm{c}>5$ K. Notably, our generative model demonstrates effective property-driven design. Our computational findings were validated against experimental synthesis and characterization performed as part of this work, which highlighted challenges in sparsely charted chemistries. This end-to-end workflow accelerates superconductor discovery while underscoring the challenge of predicting and synthesizing experimentally realizable materials.
LGSep 15, 2025
All that structure matches does not glitterMaya M. Martirossyan, Thomas Egg, Philipp Hoellmer et al.
Generative models for materials, especially inorganic crystals, hold potential to transform the theoretical prediction of novel compounds and structures. Advancement in this field depends critically on robust benchmarks and minimal, information-rich datasets that enable meaningful model evaluation. This paper critically examines common datasets and reported metrics for a crystal structure prediction task$\unicode{x2014}$generating the most likely structures given the chemical composition of a material. We focus on three key issues: First, materials datasets should contain unique crystal structures; for example, we show that the widely-utilized carbon-24 dataset only contains $\approx$40% unique structures. Second, materials datasets should not be split randomly if polymorphs of many different compositions are numerous, which we find to be the case for the perov-5 dataset. Third, benchmarks can mislead if used uncritically, e.g., reporting a match rate metric without considering the structural variety exhibited by identical building blocks. To address these oft-overlooked issues, we introduce several fixes. We provide revised versions of the carbon-24 dataset: one with duplicates removed, one deduplicated and split by number of atoms $N$, and two containing only identical structures but with different unit cells. We also propose a new split for the perov-5 dataset which ensures polymorphs are grouped within each split subset, setting a more sensible standard for benchmarking model performance. Finally, we present METRe and cRMSE, new model evaluation metrics that can correct existing issues with the match rate metric.