BMLGMay 21, 2025

Steering Generative Models with Experimental Data for Protein Fitness Optimization

arXiv:2505.15093v27 citationsh-index: 11
Originality Incremental advance
AI Analysis

This work addresses the challenge of optimizing protein sequences with limited experimental data, which is incremental but practical for researchers in computational biology.

The study tackled protein fitness optimization by evaluating strategies like classifier guidance and posterior sampling to steer generative models using small amounts of labeled data, showing that plug-and-play guidance offers advantages over alternatives such as reinforcement learning.

Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent advances in steering protein generative models (e.g., diffusion models and language models) with labeled data offer a promising approach. However, most previous studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured through low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages over alternatives such as reinforcement learning with protein language models. Overall, we provide practical insights into how to effectively steer modern generative models for next-generation protein fitness optimization.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes