Learning Collective Variables with Synthetic Data Augmentation through Physics-Inspired Geodesic Interpolation
This work addresses a bottleneck in molecular dynamics simulations for researchers studying rare events like protein folding, offering an incremental improvement in data augmentation methods.
The paper tackles the problem of learning collective variables for rare events in molecular dynamics, such as protein folding, by proposing a simulation-free data augmentation strategy using physics-inspired geodesic interpolation to generate synthetic transition data, which improves sampling efficiency without requiring true transition state samples.
In molecular dynamics simulations, rare events, such as protein folding, are typically studied using enhanced sampling techniques, most of which are based on the definition of a collective variable (CV) along which acceleration occurs. Obtaining an expressive CV is crucial, but often hindered by the lack of information about the particular event, e.g., the transition from unfolded to folded conformation. We propose a simulation-free data augmentation strategy using physics-inspired metrics to generate geodesic interpolations resembling protein folding transitions, thereby improving sampling efficiency without true transition state samples. This new data can be used to improve the accuracy of classifier-based methods. Alternatively, a regression-based learning scheme for CV models can be adopted by leveraging the interpolation progress parameter.