Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights
This work addresses the challenging problem of active learning with self-induced weighting for practitioners in computational chemistry and drug discovery, offering a principled approach with theoretical guarantees.
The paper tackles active learning for Gaussian Process regression under an unknown Boltzmann distribution induced by the function itself, a problem arising in computational chemistry. The proposed AB-SID-iVAR acquisition function achieves consistent improvements over existing methods on synthetic benchmarks and real-world tasks like potential energy surface modeling and drug discovery.
We consider the active learning problem where the goal is to learn an unknown function with low prediction error under an unknown Boltzmann distribution induced by the function itself. This self-induced weighting arises naturally in problems such as potential energy surface (PES) modeling in computational chemistry, yet poses unique challenges as the target distribution is unknown and its partition function is intractable. We propose \texttt{AB-SID-iVAR}, a Gaussian Process-based acquisition function that approximates the intractable Bayesian target distribution in closed form while avoiding partition function estimation, and is applicable to both discrete and continuous input domains. We also analyze a Thompson sampling alternative (\texttt{TS-SID-iVAR}) as a higher variance Monte Carlo variant. Despite the unknown target, under mild conditions, we establish that the terminal prediction error vanishes with high probability, and provide a tighter average-case guarantee. We demonstrate consistent improvements over existing approaches in this setting on synthetic benchmarks and real-world PES modeling and drug discovery tasks.