LG NE QMJul 26, 2024

Small Molecule Optimization with Large Language Models

Philipp Guevorguian, Menua Bedrosian, Tigran Fahradyan, Gayane Chilingaryan, Hrant Khachatrian, Armen Aghajanyan

arXiv:2407.18897v114.29 citationsh-index: 21Has Code

Originality Incremental advance

AI Analysis

This work addresses molecular optimization for drug design, representing an incremental advancement by applying existing language model techniques to a specific domain with a new corpus and algorithm.

The authors tackled the problem of generative molecular drug design by developing Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules, and a novel optimization algorithm that combines genetic algorithms, rejection sampling, and prompt optimization. The approach achieved state-of-the-art performance on multiple benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods.

Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.

View on arXiv PDF Code

Similar