BMLGQMApr 14, 2022

Generative power of a protein language model trained on multiple sequence alignments

arXiv:2204.07110v243 citationsh-index: 22
Originality Incremental advance
AI Analysis

This work addresses the challenge of protein design for biologists and computational researchers, offering a novel method that improves upon existing approaches like Potts models, though it is incremental in advancing sequence generation techniques.

The researchers tackled the problem of generating novel protein sequences by proposing an iterative method using MSA Transformer, a protein language model trained on multiple sequence alignments, and demonstrated that the resulting sequences perform as well as or better than natural sequences and Potts models in various measures, with synthetic sequences for large families having similar or better properties and outperforming Potts models for small families.

Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally-validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes