LGMar 21, 2017

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

arXiv:1703.07076v2340 citations
Originality Synthesis-oriented
AI Analysis

This addresses data scarcity in molecular QSAR modeling for researchers, but it is incremental as it applies an existing augmentation technique to a specific domain.

The paper tackled the problem of limited data in neural network modeling of molecules by using SMILES enumeration as data augmentation, resulting in improved performance with the correlation coefficient R2 increasing from 0.56 to 0.66 and root mean square error decreasing from 0.62 to 0.55.

Simplified Molecular Input Line Entry System (SMILES) is a single line text representation of a unique molecule. One molecule can however have multiple SMILES strings, which is a reason that canonical SMILES have been defined, which ensures a one to one correspondence between SMILES string and molecule. Here the fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network. The augmented dataset was 130 times bigger than the original. The network trained with the augmented dataset shows better performance on a test set when compared to a model built with only one canonical SMILES string per molecule. The correlation coefficient R2 on the test set was improved from 0.56 to 0.66 when using SMILES enumeration, and the root mean square error (RMS) likewise fell from 0.62 to 0.55. The technique also works in the prediction phase. By taking the average per molecule of the predictions for the enumerated SMILES a further improvement to a correlation coefficient of 0.68 and a RMS of 0.52 was found.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes