Scalable Batch Correction for Cell Painting via Batch-Dependent Kernels and Adaptive Sampling
This work addresses batch effects in high-content imaging for drug discovery, offering a scalable solution for large-scale Cell Painting data, though it is incremental as it builds on existing batch-correction methods with efficiency improvements.
The paper tackles batch effects in Cell Painting data, which obscure biological signals, by introducing BALANS, a scalable batch-correction method that uses batch-dependent kernels and adaptive sampling to align samples across batches efficiently. It demonstrates that BALANS scales to large datasets, improves runtime over existing methods, and maintains correction quality, with experiments showing nearly linear time complexity and order-optimal sample complexity.
Cell Painting is a microscopy-based, high-content imaging assay that produces rich morphological profiles of cells and can support drug discovery by quantifying cellular responses to chemical perturbations. At scale, however, Cell Painting data is strongly affected by batch effects arising from differences in laboratories, instruments, and protocols, which can obscure biological signal. We present BALANS (Batch Alignment via Local Affinities and Subsampling), a scalable batch-correction method that aligns samples across batches by constructing a smoothed affinity matrix from pairwise distances. Given $n$ data points, BALANS builds a sparse affinity matrix $A \in \mathbb{R}^{n \times n}$ using two ideas. (i) For points $i$ and $j$, it sets a local scale using the distance from $i$ to its $k$-th nearest neighbor within the batch of $j$, then computes $A_{ij}$ via a Gaussian kernel calibrated by these batch-aware local scales. (ii) Rather than forming all $n^2$ entries, BALANS uses an adaptive sampling procedure that prioritizes rows with low cumulative neighbor coverage and retains only the strongest affinities per row, yielding a sparse but informative approximation of $A$. We prove that this sampling strategy is order-optimal in sample complexity and provides an approximation guarantee, and we show that BALANS runs in nearly linear time in $n$. Experiments on diverse real-world Cell Painting datasets and controlled large-scale synthetic benchmarks demonstrate that BALANS scales to large collections while improving runtime over native implementations of widely used batch-correction methods, without sacrificing correction quality.