QM LGApr 29, 2024

Leak Proof CMap; a framework for training and evaluation of cell line agnostic L1000 similarity methods

Steven Shave, Richard Kasprowicz, Abdullah M. Athar, Denise Vlachou, Neil O. Carragher, Cuong Q. Nguyen

arXiv:2404.18960v11.2h-index: 12Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses a critical gap in drug discovery and personalized medicine by providing a standardized framework for unbiased evaluation of phenotypic similarity methods, enabling better-informed decisions in high-throughput screening and novel cell line applications.

The authors tackled the lack of standardized benchmarks for evaluating machine learning methods that measure phenotypic similarity using the L1000 technique, by developing 'Leak Proof CMap' with carefully crafted data splits to prevent data leakage, and demonstrated its application across three performance areas (compactness, distinctness, and uniqueness) for various similarity methods.

The Connectivity Map (CMap) is a large publicly available database of cellular transcriptomic responses to chemical and genetic perturbations built using a standardized acquisition protocol known as the L1000 technique. Databases such as CMap provide an exciting opportunity to enrich drug discovery efforts, providing a 'known' phenotypic landscape to explore and enabling the development of state of the art techniques for enhanced information extraction and better informed decisions. Whilst multiple methods for measuring phenotypic similarity and interrogating profiles have been developed, the field is severely lacking standardized benchmarks using appropriate data splitting for training and unbiased evaluation of machine learning methods. To address this, we have developed 'Leak Proof CMap' and exemplified its application to a set of common transcriptomic and generic phenotypic similarity methods along with an exemplar triplet loss-based method. Benchmarking in three critical performance areas (compactness, distinctness, and uniqueness) is conducted using carefully crafted data splits ensuring no similar cell lines or treatments with shared or closely matching responses or mechanisms of action are present in training, validation, or test sets. This enables testing of models with unseen samples akin to exploring treatments with novel modes of action in novel patient derived cell lines. With a carefully crafted benchmark and data splitting regime in place, the tooling now exists to create performant phenotypic similarity methods for use in personalized medicine (novel cell lines) and to better augment high throughput phenotypic screening technologies with the L1000 transcriptomic technology.

View on arXiv PDF Code

Similar