PLLGJan 23, 2025

Representation of Molecules via Algebraic Data Types : Advancing Beyond SMILES & SELFIES

arXiv:2501.13633v3Has Code
AI Analysis

This provides a more robust and flexible digital representation for molecules, benefiting researchers in computational chemistry and drug discovery by addressing the limitations of existing methods.

The paper tackles the problem of representing molecules for computational tasks by introducing a novel representation using Algebraic Data Types (ADTs), which overcomes limitations of string-based methods like SMILES and SELFIES by enabling meaningful inference and supporting 3D, stereochemical, and quantum information. It demonstrates the framework's capabilities through applications in Bayesian probabilistic programming, geometric learning, and chemical reaction modeling.

We introduce a novel molecular representation through Algebraic Data Types (ADTs) - composite data structures formed through the combination of simpler types that obey algebraic laws. By explicitly considering how the datatype of a representation constrains the operations which may be performed, we ensure meaningful inference can be performed over generative models (programs with sample} and score operations). This stands in contrast to string-based representations where string-type operations may only indirectly correspond to chemical and physical molecular properties, and at worst produce nonsensical output. The ADT presented implements the Dietz representation for molecular constitution via multigraphs and bonding systems, and uses atomic coordinate data to represent 3D information and stereochemical features. This creates a general digital molecular representation which surpasses the limitations of the string-based representations and the 2D-graph based models on which they are based. In addition, we present novel support for quantum information through representation of shells, subshells, and orbitals, greatly expanding the representational scope beyond current approaches, for instance in Molecular Orbital theory. The framework's capabilities are demonstrated through key applications: Bayesian probabilistic programming is demonstrated through integration with LazyPPL, a lazy probabilistic programming library; molecules are made instances of a group under rotation, necessary for geometric learning techniques which exploit the invariance of molecular properties under different representations; and the framework's flexibility is demonstrated through an extension to model chemical reactions. After critiquing previous representations, we provide an open-source solution in Haskell - a type-safe, purely functional programming language.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes