LGAICHEM-PHBMSep 19, 2024

Tokenization for Molecular Foundation Models

arXiv:2409.15370v32 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in cheminformatics for researchers in pharmacology, agriculture, biology, and energy storage, though it is incremental as it builds on existing tokenization methods.

The paper tackled the problem of closed-vocabulary tokenizers limiting molecular foundation models by systematically evaluating 34 tokenizers and proposing two new ones (Smirk and Smirk-GPE) with full coverage of the OpenSMILES specification, resulting in improved modeling of molecular space.

Text-based foundation models have become an important part of scientific discovery, with molecular foundation models accelerating advancements in material science and molecular design.However, existing models are constrained by closed-vocabulary tokenizers that capture only a fraction of molecular space. In this work, we systematically evaluate 34 tokenizers, including 19 chemistry-specific ones, and reveal significant gaps in their coverage of the SMILES molecular representation. To assess the impact of tokenizer choice, we introduce n-gram language models as a low-cost proxy and validate their effectiveness by pretraining and finetuning 18 RoBERTa-style encoders for molecular property prediction. To overcome the limitations of existing tokenizers, we propose two new tokenizers -- Smirk and Smirk-GPE -- with full coverage of the OpenSMILES specification. The proposed tokenizers systematically integrate nuclear, electronic, and geometric degrees of freedom; facilitating applications in pharmacology, agriculture, biology, and energy storage. Our results highlight the need for open-vocabulary modeling and chemically diverse benchmarks in cheminformatics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes