LGCLDCQMJan 16, 2023

Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling

arXiv:2301.06568v1162 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work addresses the need for more accessible and efficient protein modeling for researchers, though it appears incremental as it optimizes existing methods rather than introducing a new paradigm.

The authors tackled the problem of improving protein language model performance without scaling up model size, achieving state-of-the-art results with significantly fewer parameters (e.g., <10% for pre-training) and excelling in structure and function benchmarks.

As opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google's TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10% for pre-training, <7% for inference, and <30% for the embedding dimension). We provide a representative range of structure and function benchmarks where Ankh excels. We further provide a protein variant generation analysis on High-N and One-N input data scales where Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics. We dedicate our work to promoting accessibility to research innovation via attainable resources.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes