CHEM-PH LG COMP-PHOct 15, 2024

Investigating Data Hierarchies in Multifidelity Machine Learning for Excitation Energies

arXiv:2410.11392v23.35 citationsh-index: 14Has CodeJ Chem Theory Comput

Originality Incremental advance

AI Analysis

This work addresses efficiency in quantum chemistry calculations for researchers, though it is incremental as it builds on existing multifidelity methods with specific optimizations.

The study tackled the problem of optimizing multifidelity machine learning for predicting vertical excitation energies by investigating the impact of scaling factors and introducing compute time-informed adjustments, achieving high accuracy with only 2 training samples at the target fidelity when leveraging lower-fidelity data.

Recent progress in machine learning (ML) has made high-accuracy quantum chemistry (QC) calculations more accessible. Of particular interest are multifidelity machine learning (MFML) methods where training data from differing accuracies or fidelities are used. These methods usually employ a fixed scaling factor, $γ$, to relate the number of training samples across different fidelities, which reflects the cost and assumed sparsity of the data. This study investigates the impact of modifying $γ$ on model efficiency and accuracy for the prediction of vertical excitation energies using the QeMFi benchmark dataset. Further, this work introduces QC compute time informed scaling factors, denoted as $θ$, that vary based on QC compute times at different fidelities. A novel error metric, error contours of MFML, is proposed to provide a comprehensive view of model error contributions from each fidelity. The results indicate that high model accuracy can be achieved with just 2 training samples at the target fidelity when a larger number of samples from lower fidelities are used. This is further illustrated through a novel concept, the $Γ$-curve, which compares model error against the time-cost of generating training samples, demonstrating that multifidelity models can achieve high accuracy while minimizing training data costs.

View on arXiv PDF Code

Similar