LGMLNov 12, 2019

SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

arXiv:1911.04738v1255 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of low-data molecular property prediction in drug discovery, offering a novel pre-training approach that improves generalization, though it is incremental as it adapts existing NLP methods to a specific domain.

The authors tackled the problem of poor performance of rule-based molecular fingerprints in low-data drug discovery tasks by introducing SMILES Transformer, a pre-trained model that learns molecular fingerprints from SMILES sequences, achieving superior results in benchmarks on 10 datasets in small-data settings.

In drug-discovery-related tasks such as virtual screening, machine learning is emerging as a promising way to predict molecular properties. Conventionally, molecular fingerprints (numerical representations of molecules) are calculated through rule-based algorithms that map molecules to a sparse discrete space. However, these algorithms perform poorly for shallow prediction models or small datasets. To address this issue, we present SMILES Transformer. Inspired by Transformer and pre-trained language models from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequence-to-sequence language model using a huge corpus of SMILES, a text representation system for molecules. We performed benchmarks on 10 datasets against existing fingerprints and graph-based methods and demonstrated the superiority of the proposed algorithms in small-data settings where pre-training facilitated good generalization. Moreover, we define a novel metric to concurrently measure model accuracy and data efficiency.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes