LG AI CHEM-PH BMSep 19, 2024

Tokenization for Molecular Foundation Models

Alexius Wadell, Anoushka Bhutani, Venkatasubramanian Viswanathan

arXiv:2409.15370v34.62 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses a bottleneck in cheminformatics for researchers in pharmacology, agriculture, biology, and energy storage, though it is incremental as it builds on existing tokenization methods.

The paper tackled the problem of closed-vocabulary tokenizers limiting molecular foundation models by systematically evaluating 34 tokenizers and proposing two new ones (Smirk and Smirk-GPE) with full coverage of the OpenSMILES specification, resulting in improved modeling of molecular space.

Text-based foundation models have become an important part of scientific discovery, with molecular foundation models accelerating advancements in material science and molecular design.However, existing models are constrained by closed-vocabulary tokenizers that capture only a fraction of molecular space. In this work, we systematically evaluate 34 tokenizers, including 19 chemistry-specific ones, and reveal significant gaps in their coverage of the SMILES molecular representation. To assess the impact of tokenizer choice, we introduce n-gram language models as a low-cost proxy and validate their effectiveness by pretraining and finetuning 18 RoBERTa-style encoders for molecular property prediction. To overcome the limitations of existing tokenizers, we propose two new tokenizers -- Smirk and Smirk-GPE -- with full coverage of the OpenSMILES specification. The proposed tokenizers systematically integrate nuclear, electronic, and geometric degrees of freedom; facilitating applications in pharmacology, agriculture, biology, and energy storage. Our results highlight the need for open-vocabulary modeling and chemically diverse benchmarks in cheminformatics.

View on arXiv PDF

Similar