ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation
For researchers in protein binder design, this work provides a much-needed standardized benchmark to enable fair and reproducible comparison of methods, addressing the lack of unified evaluation protocols in the field.
The paper introduces ProtDBench, a standardized evaluation framework for protein binder design, and demonstrates that evaluation design choices (e.g., verifier models, filtering rules, success criteria) significantly bias observed performance, with low agreement between different structure prediction verifiers. Benchmarking of generative methods across ten targets reveals trade-offs between computational efficiency, success rate, and structural diversity under a fixed 24-hour budget.
Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.