InstructPro: Natural Language Guided Ligand-Binding Protein Design
This addresses a fundamental challenge in biology and chemistry for researchers by enabling protein design beyond traditional data limitations, though it appears incremental as it builds on existing generative modeling approaches.
The authors tackled the problem of designing ligand-binding proteins with scarce protein-ligand complex data by introducing InstructPro, a generative model that uses natural language instructions and ligand formulas, achieving design success rates up to 5.06% for seen ligands and 3.93% zero-shot.
Designing ligand-binding proteins with precise functions is fundamental to advances in biology and chemistry, yet existing AI approaches are limited by scarce protein-ligand complex data. Meanwhile, abundant text descriptions of protein-ligand interactions remain underutilized. We introduce InstructPro, a family of generative models that design proteins from natural language instructions and ligand formulas. InstructPro produces protein sequences consistent with specified functional descriptions and ligand targets. To enable training and evaluation, we develop InstructProBench, a large-scale dataset of 9.6 million (function description, ligand, protein) triples. We train two model variants: InstructPro-1B and InstructPro-3B, which substantially outperform strong baselines. InstructPro-1B achieves design success rates of 2.46% (seen ligands) and 3.14% (zero-shot), while InstructPro-3B reaches 5.06% and 3.93%, respectively. These results demonstrate the potential of natural language-guided generative modeling to expand protein design capabilities beyond traditional data limitations.