BM LG QMMar 29, 2022

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Umberto Lupo, Damiano Sgarbossa, Anne-Florence Bitbol

arXiv:2203.15465v260 citationsh-index: 22

Originality Incremental advance

AI Analysis

This work addresses the challenge of interpreting learned representations in protein language models for computational biologists, providing insights into phylogenetic encoding and noise resilience, though it is incremental as it builds on existing models like MSA Transformer.

The study tackled the problem of understanding what protein language models learn from multiple sequence alignments (MSAs), showing that MSA Transformer's column attentions strongly correlate with Hamming distances, encoding detailed phylogenetic relationships, and it can separate coevolutionary signals from phylogenetic noise, with unsupervised contact prediction being more resilient to such noise compared to Potts models.

Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.

View on arXiv PDF

Similar