LGCECLJun 11, 2025

InstructPro: Natural Language Guided Ligand-Binding Protein Design

CMU
arXiv:2506.09332v24 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses a fundamental challenge in biology and chemistry for researchers by enabling protein design beyond traditional data limitations, though it appears incremental as it builds on existing generative modeling approaches.

The authors tackled the problem of designing ligand-binding proteins with scarce protein-ligand complex data by introducing InstructPro, a generative model that uses natural language instructions and ligand formulas, achieving design success rates up to 5.06% for seen ligands and 3.93% zero-shot.

Designing ligand-binding proteins with precise functions is fundamental to advances in biology and chemistry, yet existing AI approaches are limited by scarce protein-ligand complex data. Meanwhile, abundant text descriptions of protein-ligand interactions remain underutilized. We introduce InstructPro, a family of generative models that design proteins from natural language instructions and ligand formulas. InstructPro produces protein sequences consistent with specified functional descriptions and ligand targets. To enable training and evaluation, we develop InstructProBench, a large-scale dataset of 9.6 million (function description, ligand, protein) triples. We train two model variants: InstructPro-1B and InstructPro-3B, which substantially outperform strong baselines. InstructPro-1B achieves design success rates of 2.46% (seen ligands) and 3.14% (zero-shot), while InstructPro-3B reaches 5.06% and 3.93%, respectively. These results demonstrate the potential of natural language-guided generative modeling to expand protein design capabilities beyond traditional data limitations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes