BM LGMar 19

Reinforcement-guided generative protein language models enable de novo design of highly diverse AAV capsids

Lucas Ferraz, Ana F. Rodrigues, Pedro Giesteira Cotovio, Mafalda Ventura, Gabriela Silva, Ana Sofia Coroadinha, Miguel Machuqueiro, Catia Pesquita

arXiv:2603.1947318.4h-index: 25

AI Analysis

This work addresses the problem of navigating vast protein sequence spaces for AAV bioengineering, offering a method to generate diverse and functional capsids, though it is incremental in applying existing machine learning techniques to a specific domain.

The researchers tackled the challenge of designing novel adeno-associated viral (AAV) capsids for gene therapy by developing a generative framework combining protein language models and reinforcement learning, resulting in sequences with high predicted viability and increased novelty compared to fine-tuning alone. They proposed a candidate selection strategy integrating viability, novelty, and biophysical properties to prioritize variants for experimental evaluation.

Adeno-associated viral (AAV) vectors are widely used delivery platforms in gene therapy, and the design of improved capsids is key to expanding their therapeutic potential. A central challenge in AAV bioengineering, as in protein design more broadly, is the vast sequence design space relative to the scale of feasible experimental screening. Machine-guided generative approaches provide a powerful means of navigating this landscape and proposing novel protein sequences that satisfy functional constraints. Here, we develop a generative design framework based on protein language models and reinforcement learning to generate highly novel yet functionally plausible AAV capsids. A pretrained model was fine-tuned on experimentally validated capsid sequences to learn patterns associated with viability. Reinforcement learning was then used to guide sequence generation, with a reward function that jointly promoted predicted viability and sequence novelty, thereby enabling exploration beyond regions represented in the training data. Comparative analyses showed that fine-tuning alone produces sequences with high predicted viability but remains biased toward the training distribution, whereas reinforcement learining-guided generation reaches more distant regions of sequence space while maintaining high predicted viability. Finally, we propose a candidate selection strategy that integrates predicted viability, sequence novelty, and biophysical properties to prioritize variants for downstream evaluation. This work establishes a framework for the generative exploration of protein sequence space and advances the application of generative protein language models to AAV bioengineering.

View on arXiv PDF

Similar