Geometric Self-Supervised Pretraining on 3D Protein Structures using Subgraphs
This addresses the gap in exploiting 3D protein structures for biological applications like protein function prediction, offering a novel approach that eliminates the need for multiple views or augmentations.
The authors tackled the problem of learning protein representations from 3D structures by proposing a self-supervised pre-training method that predicts distances between local subgraph centroids and the global protein centroid, achieving performance improvements of up to 6% in protein classification tasks.
Protein representation learning aims to learn informative protein embeddings capable of addressing crucial biological questions, such as protein function prediction. Although sequence-based transformer models have shown promising results by leveraging the vast amount of protein sequence data in a self-supervised way, there is still a gap in exploiting the available 3D protein structures. In this work, we propose a pre-training scheme going beyond trivial masking methods leveraging 3D and hierarchical structures of proteins. We propose a novel self-supervised method to pretrain 3D graph neural networks on 3D protein structures, by predicting the distances between local geometric centroids of protein subgraphs and the global geometric centroid of the protein. By considering subgraphs and their relationships to the global protein structure, our model can better learn the geometric properties of the protein structure. We experimentally show that our proposed pertaining strategy leads to significant improvements up to 6\%, in the performance of 3D GNNs in various protein classification tasks. Our work opens new possibilities in unsupervised learning for protein graph models while eliminating the need for multiple views, augmentations, or masking strategies which are currently used so far.