LGAIOct 24, 2024

Bio2Token: All-atom tokenization of any biomolecular structure with Mamba

arXiv:2410.19110v39 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the need for high-fidelity, all-atom representations in biomolecular design, which is incremental as it builds on existing tokenization methods but extends them to larger systems and multiple biomolecule types.

The paper tackled the problem of efficiently encoding large 3D biomolecular structures at the atom level, achieving reconstruction accuracies below 1 Angstrom and scaling to systems with nearly 100,000 atoms using a Mamba state space model.

Efficient encoding and representation of large 3D molecular structures with high fidelity is critical for biomolecular design applications. Despite this, many representation learning approaches restrict themselves to modeling smaller systems or use coarse-grained approximations of the systems, for example modeling proteins at the resolution of amino acid residues rather than at the level of individual atoms. To address this, we develop quantized auto-encoders that learn atom-level tokenizations of complete proteins, RNA and small molecule structures with reconstruction accuracies well below 1 Angstrom. We demonstrate that a simple Mamba state space model architecture is efficient compared to an SE(3)-invariant IPA architecture, reaches competitive accuracies and can scale to systems with almost 100,000 atoms. The learned structure tokens of bio2token may serve as the input for all-atom generative models in the future.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes