LGMay 16

Atoms as Language: VQ-Atom: Semantic Discretization for Molecular Representation Learning

arXiv:2605.168231.2
Predicted impact top 99% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For molecular representation learning in drug discovery, VQ-Atom provides a semantically meaningful tokenization that enhances predictive performance over conventional SMILES-based approaches.

VQ-Atom introduces a semantic discretization method that converts continuous atom-level graph representations into discrete tokens representing local chemical environments, improving protein-ligand interaction prediction under protein-cold split without 3D structures.

Molecular representation learning has become a central approach in AI-driven drug discovery, yet existing molecular tokenizations such as SMILES remain largely syntactic and do not naturally align with chemically meaningful substructures. In this work, we introduce VQ-Atom, a semantic discretization framework that converts continuous atom-level graph representations into discrete tokens corresponding to local chemical environments. Using graph neural network embeddings and vector quantization, atoms are assigned to codebook entries representing chemically meaningful atomic contexts. These discrete tokens define a molecular language suitable for Transformer-based pretraining. We evaluate VQ-Atom in protein-ligand interaction prediction under a protein-cold split setting without relying on 3D structural information. Experimental results show that VQ-Atom consistently improves predictive performance compared to conventional tokenization approaches, suggesting that semantically grounded discretization can substantially enhance molecular representation learning. Our findings indicate that token design itself plays a critical role in enabling effective language modeling for chemistry.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes