LG AIJun 13, 2025

Robust Molecular Property Prediction via Densifying Scarce Labeled Data

Jina Kim, Jeffrey Willette, Bruno Andreis, Sung Ju Hwang

arXiv:2506.11877v37.11 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This addresses a critical issue in drug discovery where models often fail on novel compounds, though it appears incremental as it builds on existing methods for handling distribution shifts.

The paper tackles the problem of poor generalization in molecular property prediction due to covariate shift and scarce labeled data by proposing a bilevel optimization approach that uses unlabeled data to interpolate between in-distribution and out-of-distribution data, resulting in significant performance gains on real-world datasets.

A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substantial covariate shift, under which standard deep learning models produce unstable and inaccurate predictions. Furthermore, the scarcity of labeled data-stemming from the onerous and costly nature of experimental validation-further exacerbates the difficulty of achieving reliable generalization. To address these limitations, we propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data, enabling the model to learn how to generalize beyond the training distribution. We demonstrate significant performance gains on challenging real-world datasets with substantial covariate shift, supported by t-SNE visualizations highlighting our interpolation method.

View on arXiv PDF Code

Similar