Learning to design protein-protein interactions with enhanced generalization
This work addresses the generalization challenge in designing protein-protein interactions for biomedical research and therapeutic development, representing a strong specific gain with novel methods.
The authors tackled the problem of machine learning models struggling to generalize in protein-protein interaction (PPI) mutation prediction by constructing PPIRef, a large non-redundant dataset, and developing PPIformer, an SE(3)-equivariant model that outperformed state-of-the-art methods on new data splits and case studies, such as optimizing a human antibody against SARS-CoV-2 and enhancing thrombolytic activity.
Discovering mutations enhancing protein-protein interactions (PPIs) is critical for advancing biomedical research and developing improved therapeutics. While machine learning approaches have substantially advanced the field, they often struggle to generalize beyond training data in practical scenarios. The contributions of this work are three-fold. First, we construct PPIRef, the largest and non-redundant dataset of 3D protein-protein interactions, enabling effective large-scale learning. Second, we leverage the PPIRef dataset to pre-train PPIformer, a new SE(3)-equivariant model generalizing across diverse protein-binder variants. We fine-tune PPIformer to predict effects of mutations on protein-protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on new, non-leaking splits of standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-CoV-2 and increasing the thrombolytic activity of staphylokinase.