Data Fusion of Deep Learned Molecular Embeddings for Property Prediction
This addresses the challenge of accurate property prediction in materials science when data is sparse, though it is incremental as it builds on existing multitask learning techniques.
The paper tackles the problem of poor performance of multitask learning models on sparse datasets with weakly correlated properties by fusing deep-learned embeddings from pretrained single-task models, resulting in a fused model that outperforms standard multitask models with fewer trainable parameters, as demonstrated on benchmark and newly compiled sparse datasets.
Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many applications, data is sparse, severely limiting their accuracy and applicability. To improve predictions, techniques such as transfer learning and multitask learning have been used. The performance of multitask learning models depends on the strength of the underlying correlations between tasks and the completeness of the data set. Standard multitask models tend to underperform when trained on sparse data sets with weakly correlated properties. To address this gap, we fuse deep-learned embeddings generated by independent pretrained single-task models, resulting in a multitask model that inherits rich, property-specific representations. By reusing (rather than retraining) these embeddings, the resulting fused model outperforms standard multitask models and can be extended with fewer trainable parameters. We demonstrate this technique on a widely used benchmark data set of quantum chemistry data for small molecules as well as a newly compiled sparse data set of experimental data collected from literature and our own quantum chemistry and thermochemical calculations.