LGOct 4, 2023

SALSA: Semantically-Aware Latent Space Autoencoder

Kathryn E. Kirchoff, Travis Maxfield, Alexander Tropsha, Shawn M. Gomez

arXiv:2310.02744v16.63 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses a specific issue in drug discovery for researchers using deep learning, though it appears incremental as it modifies existing methods with a contrastive objective.

The paper tackled the problem of autoencoders learning incoherent latent spaces for molecular data represented as SMILES sequences, by proposing SALSA, a transformer-autoencoder with a contrastive task, which resulted in a higher quality latent space that is more structurally-aware, semantically continuous, and property-aware.

In deep learning for drug discovery, chemical data are often represented as simplified molecular-input line-entry system (SMILES) sequences which allow for straightforward implementation of natural language processing methodologies, one being the sequence-to-sequence autoencoder. However, we observe that training an autoencoder solely on SMILES is insufficient to learn molecular representations that are semantically meaningful, where semantics are defined by the structural (graph-to-graph) similarities between molecules. We demonstrate by example that autoencoders may map structurally similar molecules to distant codes, resulting in an incoherent latent space that does not respect the structural similarities between molecules. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA), a transformer-autoencoder modified with a contrastive task, tailored specifically to learn graph-to-graph similarity between molecules. Formally, the contrastive objective is to map structurally similar molecules (separated by a single graph edit) to nearby codes in the latent space. To accomplish this, we generate a novel dataset comprised of sets of structurally similar molecules and opt for a supervised contrastive loss that is able to incorporate full sets of positive samples. We compare SALSA to its ablated counterparts, and show empirically that the composed training objective (reconstruction and contrastive task) leads to a higher quality latent space that is more 1) structurally-aware, 2) semantically continuous, and 3) property-aware.

View on arXiv PDF

Similar