BioBlobs: Differentiable Graph Partitioning for Protein Representation Learning
This work addresses the need for better protein function prediction by enabling more accurate and interpretable representations, though it is incremental as it builds on existing encoders.
The paper tackled the problem of protein representation learning by introducing BioBlobs, a differentiable module that dynamically partitions proteins into flexible substructures, leading to improved performance of protein encoders like GVP-GNN across various tasks.
Protein function is driven by coherent substructures which vary in size and topology, yet current protein representation learning models (PRL) distort these signals by relying on rigid substructures such as k-hop and fixed radius neighbourhoods. We introduce BioBlobs, a plug-and-play, fully differentiable module that represents proteins by dynamically partitioning structures into flexibly-sized, non-overlapping substructures ("blobs"). The resulting blobs are quantized into a shared and interpretable codebook, yielding a discrete vocabulary of function-relevant protein substructures used to compute protein embeddings. We show that BioBlobs representations improve the performance of widely used protein encoders such as GVP-GNN across various PRL tasks. Our approach highlights the value of architectures that directly capture function-relevant protein substructures, enabling both improved predictive performance and mechanistic insight into protein function.