SoDaDE: Solvent Data-Driven Embeddings with Small Transformer Models
This work addresses the need for green solvent replacement in the chemical industry by providing a more effective representation, though it is incremental as it builds on existing transformer and data-driven methods.
The authors tackled the problem of generic chemical representations lacking physical context for solvents by proposing SoDaDE, a solvent-specific data-driven embedding using a small transformer model, which outperformed previous representations in predicting yields on a recent dataset.
Computational representations have become crucial in unlocking the recent growth of machine learning algorithms for chemistry. Initially hand-designed, machine learning has shown that meaningful representations can be learnt from data. Chemical datasets are limited and so the representations learnt from data are generic, being trained on broad datasets which contain shallow information on many different molecule types. For example, generic fingerprints lack physical context specific to solvents. However, the use of harmful solvents is a leading climate-related issue in the chemical industry, and there is a surge of interest in green solvent replacement. To empower this research, we propose a new solvent representation scheme by developing Solvent Data Driven Embeddings (SoDaDE). SoDaDE uses a small transformer model and solvent property dataset to create a fingerprint for solvents. To showcase their effectiveness, we use SoDaDE to predict yields on a recently published dataset, outperforming previous representations. We demonstrate through this paper that data-driven fingerprints can be made with small datasets and set-up a workflow that can be explored for other applications.