Ensemble Learning of Coarse-Grained Molecular Dynamics Force Fields with a Kernel Approach
This work addresses the challenge of memory inefficiency in coarse-graining for molecular simulations, offering a data-efficient method with potential applications in computational chemistry and biophysics, though it is incremental as it builds on existing gradient-domain machine learning techniques.
The authors tackled the problem of learning coarse-grained molecular dynamics force fields from all-atom simulation data by proposing a novel ensemble learning and stratified sampling scheme within gradient-domain machine learning, achieving smaller free energy errors than neural networks with small training sets and comparable accuracy with larger sets.
Gradient-domain machine learning (GDML) is an accurate and efficient approach to learn a molecular potential and associated force field based on the kernel ridge regression algorithm. Here, we demonstrate its application to learn an effective coarse-grained (CG) model from all-atom simulation data in a sample efficient manner. The coarse-grained force field is learned by following the thermodynamic consistency principle, here by minimizing the error between the predicted coarse-grained force and the all-atom mean force in the coarse-grained coordinates. Solving this problem by GDML directly is impossible because coarse-graining requires averaging over many training data points, resulting in impractical memory requirements for storing the kernel matrices. In this work, we propose a data-efficient and memory-saving alternative. Using ensemble learning and stratified sampling, we propose a 2-layer training scheme that enables GDML to learn an effective coarse-grained model. We illustrate our method on a simple biomolecular system, alanine dipeptide, by reconstructing the free energy landscape of a coarse-grained variant of this molecule. Our novel GDML training scheme yields a smaller free energy error than neural networks when the training set is small, and a comparably high accuracy when the training set is sufficiently large.