LGAIBMFeb 27, 2024

TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation

arXiv:2402.17156v17 citationsh-index: 12Has CodeSci China Inf Sci
Originality Incremental advance
AI Analysis

This work addresses the need for controllable protein design in biology and chemistry, offering a method for generating taxonomic-specific proteins, though it is incremental as it builds on existing diffusion models.

The authors tackled the problem of controllable protein sequence generation by proposing TaxDiff, a taxonomic-guided diffusion model that integrates biological species information to generate structurally stable proteins, achieving better performance on multiple benchmarks and generating sequences with higher confidence in predicted structures while being four times faster than direct-structure-generation models.

Designing protein sequences with specific biological functions and structural stability is crucial in biology and chemistry. Generative models already demonstrated their capabilities for reliable protein design. However, previous models are limited to the unconditional generation of protein sequences and lack the controllable generation ability that is vital to biological tasks. In this work, we propose TaxDiff, a taxonomic-guided diffusion model for controllable protein sequence generation that combines biological species information with the generative capabilities of diffusion models to generate structurally stable proteins within the sequence space. Specifically, taxonomic control information is inserted into each layer of the transformer block to achieve fine-grained control. The combination of global and local attention ensures the sequence consistency and structural foldability of taxonomic-specific proteins. Extensive experiments demonstrate that TaxDiff can consistently achieve better performance on multiple protein sequence generation benchmarks in both taxonomic-guided controllable generation and unconditional generation. Remarkably, the sequences generated by TaxDiff even surpass those produced by direct-structure-generation models in terms of confidence based on predicted structures and require only a quarter of the time of models based on the diffusion model. The code for generating proteins and training new versions of TaxDiff is available at:https://github.com/Linzy19/TaxDiff.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes