LGBMFeb 6

Adaptive Protein Tokenization

arXiv:2602.06418v12 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses limitations in protein structure tokenization for researchers in computational biology and AI, offering incremental improvements over local methods.

The paper tackles the problem of tokenizing protein structures for multi-modal models by introducing a global tokenization method that mitigates error accumulation and enhances generative and representation tasks, demonstrating performance matching or outperforming existing local tokenizers on tasks like reconstruction and CATH classification.

Tokenization is a promising path to multi-modal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence-reduction operations, and allows task-specific adaptation of a tokenized sequence's information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing models based on local protein structure tokenizers. We show how adaptive tokens enable inference criteria based on information content, which boosts designability. We validate representations generated from our tokenizer on CATH classification tasks and demonstrate that non-linear probing on our tokenized sequences outperforms equivalent probing on representations from other tokenizers. Finally, we demonstrate how our method supports zero-shot protein shrinking and affinity maturation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes