Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory
This provides a general tool for data-driven atomistic modeling, combining machine learning, simulations, and physical explainability, but it is incremental as it builds on existing concepts in information theory applied to a specific domain.
The authors tackled the problem of quantifying information in atomistic machine learning without relying on models, introducing a theoretical framework that uses information entropy to explain heuristics like training set sizes and dataset optimality, and proposed a model-free uncertainty quantification method that reliably predicts epistemic uncertainty and detects outliers, including rare events such as nucleation.
An accurate description of information is relevant for a range of problems in atomistic machine learning (ML), such as crafting training sets, performing uncertainty quantification (UQ), or extracting physical insights from large datasets. However, atomistic ML often relies on unsupervised learning or model predictions to analyze information contents from simulation or training data. Here, we introduce a theoretical framework that provides a rigorous, model-free tool to quantify information contents in atomistic simulations. We demonstrate that the information entropy of a distribution of atom-centered environments explains known heuristics in ML potential developments, from training set sizes to dataset optimality. Using this tool, we propose a model-free UQ method that reliably predicts epistemic uncertainty and detects out-of-distribution samples, including rare events in systems such as nucleation. This method provides a general tool for data-driven atomistic modeling and combines efforts in ML, simulations, and physical explainability.