RxRx3-core: Benchmarking drug-target interactions in High-Content Microscopy
This provides a standardized benchmark for researchers in computational biology and drug discovery, though it is incremental as it focuses on data curation rather than new methods.
The authors tackled the lack of accessible datasets and benchmarks for representation learning in high-content screening microscopy by introducing RxRx3-core, a curated 18GB subset of the RxRx3 dataset with 222,601 images, enabling zero-shot drug-target interaction prediction.
High Content Screening (HCS) microscopy datasets have transformed the ability to profile cellular responses to genetic and chemical perturbations, enabling cell-based inference of drug-target interactions (DTI). However, the adoption of representation learning methods for HCS data has been hindered by the lack of accessible datasets and robust benchmarks. To address this gap, we present RxRx3-core, a curated and compressed subset of the RxRx3 dataset, and an associated DTI benchmarking task. At just 18GB, RxRx3-core significantly reduces the size barrier associated with large-scale HCS datasets while preserving critical data necessary for benchmarking representation learning models against a zero-shot DTI prediction task. RxRx3-core includes 222,601 microscopy images spanning 736 CRISPR knockouts and 1,674 compounds at 8 concentrations. RxRx3-core is available on HuggingFace and Polaris, along with pre-trained embeddings and benchmarking code, ensuring accessibility for the research community. By providing a compact dataset and robust benchmarks, we aim to accelerate innovation in representation learning methods for HCS data and support the discovery of novel biological insights.