APObind: A Dataset of Ligand Unbound Protein Conformations for Machine Learning Applications in De Novo Drug Design
This addresses a data gap for researchers in computational drug design by providing a benchmark for validating methods on realistic unbound protein structures, though it is incremental as it builds on existing datasets.
The authors tackled the problem that machine learning methods for drug design are typically trained on ligand-bound protein conformations, which may not perform well on native unbound conformations, by proposing APObind, a dataset of unbound protein conformations derived from PDBbind, and demonstrated its importance through performance evaluations on three use cases.
Protein-ligand complex structures have been utilised to design benchmark machine learning methods that perform important tasks related to drug design such as receptor binding site detection, small molecule docking and binding affinity prediction. However, these methods are usually trained on only ligand bound (or holo) conformations of the protein and therefore are not guaranteed to perform well when the protein structure is in its native unbound conformation (or apo), which is usually the conformation available for a newly identified receptor. A primary reason for this is that the local structure of the binding site usually changes upon ligand binding. To facilitate solutions for this problem, we propose a dataset called APObind that aims to provide apo conformations of proteins present in the PDBbind dataset, a popular dataset used in drug design. Furthermore, we explore the performance of methods specific to three use cases on this dataset, through which, the importance of validating them on the APObind dataset is demonstrated.