Modelling Technical and Biological Effects in scRNA-seq data with Scalable GPLVMs
This work addresses the need for scalable dimensionality reduction techniques in single-cell genomics to disentangle biological variation from confounders, which is incremental as it extends an existing method.
The authors tackled the problem of scaling probabilistic non-linear dimensionality reduction for single-cell RNA-seq data to handle massive datasets while accounting for technical and biological confounders, achieving a 9x reduction in training time and enabling data integration across 130 individuals to capture interpretable signatures of infection.
Single-cell RNA-seq datasets are growing in size and complexity, enabling the study of cellular composition changes in various biological/clinical contexts. Scalable dimensionality reduction techniques are in need to disentangle biological variation in them, while accounting for technical and biological confounders. In this work, we extend a popular approach for probabilistic non-linear dimensionality reduction, the Gaussian process latent variable model, to scale to massive single-cell datasets while explicitly accounting for technical and biological confounders. The key idea is to use an augmented kernel which preserves the factorisability of the lower bound allowing for fast stochastic variational inference. We demonstrate its ability to reconstruct latent signatures of innate immunity recovered in Kumasaka et al. (2021) with 9x lower training time. We further analyze a COVID dataset and demonstrate across a cohort of 130 individuals, that this framework enables data integration while capturing interpretable signatures of infection. Specifically, we explore COVID severity as a latent dimension to refine patient stratification and capture disease-specific gene expression.