LGMar 16

Training-Free Generation of Protein Sequences from Small Family Alignments via Stochastic Attention

arXiv:2603.1471715.12 citationsh-index: 3
AI Analysis

This addresses the challenge of protein sequence generation for families with limited data, offering a fast, training-free method that avoids overfitting and produces structurally plausible sequences, though it is incremental as it builds on existing energy-based and attention concepts.

The paper tackled the problem of generating protein sequences from small family alignments where deep models overfit, by proposing stochastic attention (SA), a training-free sampler that treats Hopfield energy as a Boltzmann distribution and uses Langevin dynamics, resulting in sequences with 51-66% identity to natural members, novelty, and improved structural plausibility in six of eight families.

Most protein families have fewer than 100 known members, a regime where deep generative models overfit or collapse. We propose stochastic attention (SA), a training-free sampler that treats the modern Hopfield energy over a protein alignment as a Boltzmann distribution and draws samples via Langevin dynamics. The score function is a closed-form softmax attention operation requiring no training, no pretraining data, and no GPU, with cost linear in alignment size. Across eight Pfam families, SA generates sequences with low amino acid compositional divergence, substantial novelty, and structural plausibility confirmed by ESMFold and AlphaFold2. Generated sequences fold more faithfully to canonical family structures than natural members in six of eight families. Against profile HMMs, EvoDiff, and the MSA Transformer, which produce sequences that drift far outside the family, SA maintains 51 to 66 percent identity while remaining novel, in seconds on a laptop. The critical temperature governing generation is predicted from PCA dimensionality alone, enabling fully automatic operation. Controls confirm SA encodes correlated substitution patterns, not just per-position amino acid frequencies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes