BMAILGMar 15, 2024

On Recovering Higher-order Interactions from Protein Language Models

arXiv:2405.06645v19 citationsh-index: 10Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of interpretability and computational efficiency in protein language models for researchers in computational biology and bioinformatics, though it is incremental as it builds on existing sparse Fourier transform methods.

The paper tackled the problem of extracting higher-order mutational interactions from protein language models, which is computationally expensive due to the exponential sequence space, by developing a Fourier analysis framework for ESM2 on three proteins, achieving recovery of all interactions with R²=0.72 and 0.66 in different regions using only 7 million samples, reducing computational time by a factor of 15,000.

Protein language models leverage evolutionary information to perform state-of-the-art 3D structure and zero-shot variant prediction. Yet, extracting and explaining all the mutational interactions that govern model predictions remains difficult as it requires querying the entire amino acid space for $n$ sites using $20^n$ sequences, which is computationally expensive even for moderate values of $n$ (e.g., $n\sim10$). Although approaches to lower the sample complexity exist, they often limit the interpretability of the model to just single and pairwise interactions. Recently, computationally scalable algorithms relying on the assumption of sparsity in the Fourier domain have emerged to learn interactions from experimental data. However, extracting interactions from language models poses unique challenges: it's unclear if sparsity is always present or if it is the only metric needed to assess the utility of Fourier algorithms. Herein, we develop a framework to do a systematic Fourier analysis of the protein language model ESM2 applied on three proteins-green fluorescent protein (GFP), tumor protein P53 (TP53), and G domain B1 (GB1)-across various sites for 228 experiments. We demonstrate that ESM2 is dominated by three regions in the sparsity-ruggedness plane, two of which are better suited for sparse Fourier transforms. Validations on two sample proteins demonstrate recovery of all interactions with $R^2=0.72$ in the more sparse region and $R^2=0.66$ in the more dense region, using only 7 million out of $20^{10}\sim10^{13}$ ESM2 samples, reducing the computational time by a staggering factor of 15,000. All codes and data are available on our GitHub repository https://github.com/amirgroup-codes/InteractionRecovery.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes