LGJan 15, 2025

VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science

Oxford
arXiv:2501.08995v22 citationsh-index: 29
Originality Incremental advance
AI Analysis

This addresses the problem of limited data for pharmaceutical scientists, enabling more data-driven approaches, though it is incremental as it builds on existing generative models for a specific domain.

The paper tackles data scarcity in pharmaceutical research by developing VECT-GAN, a generative model that augments small, noisy datasets, resulting in consistent and significant performance improvements over state-of-the-art tabular generative models across six pharmaceutical datasets, including the development of novel polymers with medically desirable mucoadhesive properties.

Data scarcity in pharmaceutical research has led to reliance on labour-intensive trial-and-error approaches for development rather than data-driven methods. While Machine Learning offers a solution, existing datasets are often small and noisy, limiting their utility. To address this, we developed a Variationally Encoded Conditional Tabular Generative Adversarial Network (VECT-GAN), a novel generative model specifically designed for augmenting small, noisy datasets. We introduce a pipeline where data is augmented before regression model development and demonstrate that this consistently and significantly improves performance over other state-of-the-art tabular generative models. We apply this pipeline across six pharmaceutical datasets, and highlight its real-world applicability by developing novel polymers with medically desirable mucoadhesive properties, which we made and experimentally characterised. Additionally, we pre-train the model on the ChEMBL database of drug-like molecules, leveraging knowledge distillation to enhance its generalisability, making it readily available for use on pharmaceutical datasets containing small molecules, an extremely common pharmaceutical task. We demonstrate the power of synthetic data for regularising small tabular datasets, highlighting its potential to become standard practice in pharmaceutical model development, and make our method, including VECT-GAN pre-trained on ChEMBL available as a pip package.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes