O. Anatole von Lilienfeld

STR-EL

h-index58

7papers

126citations

Novelty44%

AI Score42

Ranked #62,975 of 194,257 authors (top 32%)#5 in STR-EL (top 19%)

7 Papers

1.2SOC-PHNov 26, 2025

AI4X Roadmap: Artificial Intelligence for the advancement of scientific pursuit and its future directions

Stephen G. Dale, Nikita Kazeev, Alastair J. A. Price et al.

Artificial intelligence and machine learning are reshaping how we approach scientific discovery, not by replacing established methods but by extending what researchers can probe, predict, and design. In this roadmap we provide a forward-looking view of AI-enabled science across biology, chemistry, climate science, mathematics, materials science, physics, self-driving laboratories and unconventional computing. Several shared themes emerge: the need for diverse and trustworthy data, transferable electronic-structure and interatomic models, AI systems integrated into end-to-end scientific workflows that connect simulations to experiments and generative systems grounded in synthesisability rather than purely idealised phases. Across domains, we highlight how large foundation models, active learning and self-driving laboratories can close loops between prediction and validation while maintaining reproducibility and physical interpretability. Taken together, these perspectives outline where AI-enabled science stands today, identify bottlenecks in data, methods and infrastructure, and chart concrete directions for building AI systems that are not only more powerful but also more transparent and capable of accelerating discovery in complex real-world environments.

2.9CRDec 5, 2022Code

Encrypted machine learning of molecular quantum properties

Jan Weinreich, Guido Falk von Rudorff, O. Anatole von Lilienfeld

Large machine learning models with improved predictions have become widely available in the chemical sciences. Unfortunately, these models do not protect the privacy necessary within commercial settings, prohibiting the use of potentially extremely valuable data by others. Encrypting the prediction process can solve this problem by double-blind model evaluation and prohibits the extraction of training or query data. However, contemporary ML models based on fully homomorphic encryption or federated learning are either too expensive for practical use or have to trade higher speed for weaker security. We have implemented secure and computationally feasible encrypted machine learning models using oblivious transfer enabling and secure predictions of molecular quantum properties across chemical compound space. However, we find that encrypted predictions using kernel ridge regression models are a million times more expensive than without encryption. This demonstrates a dire need for a compact machine learning model architecture, including molecular representation and kernel matrix size, that minimizes model evaluation costs.

1.2CHEM-PHMay 8, 2024Code

Data-Error Scaling Laws in Machine Learning on Combinatorial Mutation-prone Sets: Proteins and Small Molecules

Vanni Doffini, O. Anatole von Lilienfeld, Michael A. Nash

We investigate trends in the data-error scaling laws of machine learning (ML) models trained on discrete combinatorial spaces that are prone-to-mutation, such as proteins or organic small molecules. We trained and evaluated kernel ridge regression machines using variable amounts of computational and experimental training data. Our synthetic datasets comprised i) two naïve functions based on many-body theory; ii) binding energy estimates between a protein and a mutagenised peptide; and iii) solvation energies of two 6-heavy atom structural graphs, while the experimental dataset consisted of a full deep mutational scan of the binding protein GB1. In contrast to typical data-error scaling laws, our results showed discontinuous monotonic phase transitions during learning, observed as rapid drops in the test error at particular thresholds of training data. We observed two learning regimes, which we call saturated and asymptotic decay, and found that they are conditioned by the level of complexity (i.e. number of mutations) enclosed in the training set. We show that during training on this class of problems, the predictions were clustered by the ML models employed in the calibration plots. Furthermore, we present an alternative strategy to normalize learning curves (LCs) and introduce the concept of mutant-based shuffling. This work has implications for machine learning on mutagenisable discrete spaces such as chemical properties or protein phenotype prediction, and improves basic understanding of concepts in statistical learning theory.

4.5MLOct 10, 2025

Gradient-Guided Furthest Point Sampling for Robust Training Set Selection

Morris Trestman, Stefan Gugler, Felix A. Faber et al.

Smart training set selections procedures enable the reduction of data needs and improves predictive robustness in machine learning problems relevant to chemistry. We introduce Gradient Guided Furthest Point Sampling (GGFPS), a simple extension of Furthest Point Sampling (FPS) that leverages molecular force norms to guide efficient sampling of configurational spaces of molecules. Numerical evidence is presented for a toy-system (Styblinski-Tang function) as well as for molecular dynamics trajectories from the MD17 dataset. Compared to FPS and uniform sampling, our numerical results indicate superior data efficiency and robustness when using GGFPS. Distribution analysis of the MD17 data suggests that FPS systematically under-samples equilibrium geometries, resulting in large test errors for relaxed structures. GGFPS cures this artifact and (i) enables up to two fold reductions in training cost without sacrificing predictive accuracy compared to FPS in the 2-dimensional Styblinksi-Tang system, (ii) systematically lowers prediction errors for equilibrium as well as strained structures in MD17, and (iii) systematically decreases prediction error variances across all of the MD17 configuration spaces. These results suggest that gradient-aware sampling methods hold great promise as effective training set selection tools, and that naive use of FPS may result in imbalanced training and inconsistent prediction outcomes.

2.3CHEM-PHJan 23, 2017

Constant Size Molecular Descriptors For Use With Machine Learning

Christopher R. Collins, Geoffrey J. Gordon, O. Anatole von Lilienfeld et al.

A set of molecular descriptors whose length is independent of molecular size is developed for machine learning models that target thermodynamic and electronic properties of molecules. These features are evaluated by monitoring performance of kernel ridge regression models on well-studied data sets of small organic molecules. The features include connectivity counts, which require only the bonding pattern of the molecule, and encoded distances, which summarize distances between both bonded and non-bonded atoms and so require the full molecular geometry. In addition to having constant size, these features summarize information regarding the local environment of atoms and bonds, such that models can take advantage of similarities resulting from the presence of similar chemical fragments across molecules. Combining these two types of features leads to models whose performance is comparable to or better than the current state of the art. The features introduced here have the advantage of leading to models that may be trained on smaller molecules and then used successfully on larger molecules.

5.1STR-ELJun 29, 2015

Machine learning for many-body physics: efficient solution of dynamical mean-field theory

Louis-François Arsenault, O. Anatole von Lilienfeld, Andrew J. Millis

Machine learning methods for solving the equations of dynamical mean-field theory are developed. The method is demonstrated on the three dimensional Hubbard model. The key technical issues are defining a mapping of an input function to an output function, and distinguishing metallic from insulating solutions. Both metallic and Mott insulator solutions can be predicted. The validity of the machine learning scheme is assessed by comparing predictions of full correlation functions, of quasi-particle weight and particle density to values directly computed. The results indicate that with modest further development, machine learning approach may be an attractive computational efficient option for real materials predictions for strongly correlated systems.

11.3STR-ELAug 5, 2014

Machine learning for many-body physics: The case of the Anderson impurity model

Louis-François Arsenault, Alejandro Lopez-Bezanilla, O. Anatole von Lilienfeld et al.

Machine learning methods are applied to finding the Green's function of the Anderson impurity model, a basic model system of quantum many-body condensed-matter physics. Different methods of parametrizing the Green's function are investigated; a representation in terms of Legendre polynomials is found to be superior due to its limited number of coefficients and its applicability to state of the art methods of solution. The dependence of the errors on the size of the training set is determined. The results indicate that a machine learning approach to dynamical mean-field theory may be feasible.