MTRL-SCILGCHEM-PHDec 18, 2025

How accurate are foundational machine learning interatomic potentials for heterogeneous catalysis?

arXiv:2512.16702v15 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the gap in benchmarking MLIPs for real-world heterogeneous catalysis applications, highlighting performance limitations and the need for careful model selection, which is crucial for researchers in materials science and catalysis.

The study systematically evaluated the zero-shot performance of 80 foundational machine learning interatomic potentials (MLIPs) for heterogeneous catalysis tasks, finding that they can achieve high accuracy in predicting properties like vacancy formation energies and zero-point energies, but often fail catastrophically on magnetic materials and show increased errors during structure relaxation.

Foundational machine learning interatomic potentials (MLIPs) are being developed at a rapid pace, promising closer and closer approximation to ab initio accuracy. This unlocks the possibility to simulate much larger length and time scales. However, benchmarks for these MLIPs are usually limited to ordered, crystalline and bulk materials. Hence, reported performance does not necessarily accurately reflect MLIP performance in real applications such as heterogeneous catalysis. Here, we systematically analyze zero-shot performance of 80 different MLIPs, evaluating tasks typical for heterogeneous catalysis across a range of different data sets, including adsorption and reaction on surfaces of alloyed metals, oxides, and metal-oxide interfacial systems. We demonstrate that current-generation foundational MLIPs can already perform at high accuracy for applications such as predicting vacancy formation energies of perovskite oxides or zero-point energies of supported nanoclusters. However, limitations also exist. We find that many MLIPs catastrophically fail when applied to magnetic materials, and structure relaxation in the MLIP generally increases the energy prediction error compared to single-point evaluation of a previously optimized structure. Comparing low-cost task-specific models to foundational MLIPs, we highlight some core differences between these model approaches and show that -- if considering only accuracy -- these models can compete with the current generation of best-performing MLIPs. Furthermore, we show that no single MLIP universally performs best, requiring users to investigate MLIP suitability for their desired application.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes