LGQMJun 9, 2025

The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning

arXiv:2506.07619v12 citationsh-index: 44Has Code
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for the machine learning community to address solvent selection in chemistry, though it is incremental as it focuses on dataset creation rather than method innovation.

The authors tackled the lack of accessible chemical datasets for machine learning by introducing a novel transient flow dataset for yield prediction, covering over 1200 process conditions, and demonstrated its use in benchmarking various algorithms for solvent selection.

Machine learning has promised to change the landscape of laboratory chemistry, with impressive results in molecular property prediction and reaction retro-synthesis. However, chemical datasets are often inaccessible to the machine learning community as they tend to require cleaning, thorough understanding of the chemistry, or are simply not available. In this paper, we introduce a novel dataset for yield prediction, providing the first-ever transient flow dataset for machine learning benchmarking, covering over 1200 process conditions. While previous datasets focus on discrete parameters, our experimental set-up allow us to sample a large number of continuous process conditions, generating new challenges for machine learning models. We focus on solvent selection, a task that is particularly difficult to model theoretically and therefore ripe for machine learning applications. We showcase benchmarking for regression algorithms, transfer-learning approaches, feature engineering, and active learning, with important applications towards solvent replacement and sustainable manufacturing.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes