CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction
This work addresses the need for better predictive models in chemical product development, such as drug formulations and battery electrolytes, by providing a foundational dataset and benchmarks, though it is incremental as it builds on existing datasets.
The paper tackles the problem of predicting properties for chemical mixtures, which is crucial for industry but underexplored in machine learning, by introducing CheMixHub, a benchmark with 11 tasks and approximately 500k data points from 7 datasets.
Developing improved predictive models for multi-molecular systems is crucial, as nearly every chemical product used results from a mixture of chemicals. While being a vital part of the industry pipeline, the chemical mixture space remains relatively unexplored by the Machine Learning community. In this paper, we introduce CheMixHub, a holistic benchmark for molecular mixtures, covering a corpus of 11 chemical mixtures property prediction tasks, from drug delivery formulations to battery electrolytes, totalling approximately 500k data points gathered and curated from 7 publicly available datasets. CheMixHub introduces various data splitting techniques to assess context-specific generalization and model robustness, providing a foundation for the development of predictive models for chemical mixture properties. Furthermore, we map out the modelling space of deep learning models for chemical mixtures, establishing initial benchmarks for the community. This dataset has the potential to accelerate chemical mixture development, encompassing reformulation, optimization, and discovery. The dataset and code for the benchmarks can be found at: https://github.com/chemcognition-lab/chemixhub