Instructor-inspired Machine Learning for Robust Molecular Property Prediction
This work addresses the problem of limited labeled data for researchers in chemical and biological sciences, though it appears incremental as it builds on existing pseudo-labeling methods.
The authors tackled the challenge of data sparsity in molecular property prediction by developing InstructMol, an instructive learning algorithm that leverages pseudo-labels from unlabeled data, achieving high accuracy on real-world datasets and OOD benchmarks.
Machine learning catalyzes a revolution in chemical and biological science. However, its efficacy heavily depends on the availability of labeled data, and annotating biochemical data is extremely laborious. To surmount this data sparsity challenge, we present an instructive learning algorithm named InstructMol to measure pseudo-labels' reliability and help the target model leverage large-scale unlabeled data. InstructMol does not require transferring knowledge between multiple domains, which avoids the potential gap between the pretraining and fine-tuning stages. We demonstrated the high accuracy of InstructMol on several real-world molecular datasets and out-of-distribution (OOD) benchmarks. Code is available at~ https://github.com/smiles724/InstructMol.