CHEM-PHLGFeb 7, 2023

Recent advances in the Self-Referencing Embedding Strings (SELFIES) library

arXiv:2302.03620v121 citationsh-index: 109Has Code
Originality Incremental advance
AI Analysis

This work solves the problem of unreliable molecular string generation for cheminformatics and deep learning applications, though it is incremental as it builds on prior SELFIES developments.

The paper addresses the problem of syntactic and semantic errors in traditional string-based molecular representations like SMILES by presenting SELFIES, a novel representation that is inherently 100% robust, and updates its library with generalizations for more molecules and improved efficiency.

String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencIng Embedded Strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of \selfieslib, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of \selfieslib (version 2.1.1) in this manuscript.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes