GNAILGAug 2, 2025

A Novel cVAE-Augmented Deep Learning Framework for Pan-Cancer RNA-Seq Classification

arXiv:2508.02743v11 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of high dimensionality and limited sample sizes in pan-cancer classification for tumor subtyping and therapy selection, representing an incremental improvement with a novel method for a known bottleneck.

The authors tackled pan-cancer classification from RNA-Seq data by proposing a cVAE-augmented deep learning framework to generate synthetic samples, achieving ~98% accuracy on a test set, which substantially outperformed using original data alone.

Pan-cancer classification using transcriptomic (RNA-Seq) data can inform tumor subtyping and therapy selection, but is challenging due to extremely high dimensionality and limited sample sizes. In this study, we propose a novel deep learning framework that uses a class-conditional variational autoencoder (cVAE) to augment training data for pan-cancer gene expression classification. Using 801 tumor RNA-Seq samples spanning 5 cancer types from The Cancer Genome Atlas (TCGA), we first perform feature selection to reduce 20,531 gene expression features to the 500 most variably expressed genes. A cVAE is then trained on this data to learn a latent representation of gene expression conditioned on cancer type, enabling the generation of synthetic gene expression samples for each tumor class. We augment the training set with these cVAE-generated samples (doubling the dataset size) to mitigate overfitting and class imbalance. A two-layer multilayer perceptron (MLP) classifier is subsequently trained on the augmented dataset to predict tumor type. The augmented framework achieves high classification accuracy (~98%) on a held-out test set, substantially outperforming a classifier trained on the original data alone. We present detailed experimental results, including VAE training curves, classifier performance metrics (ROC curves and confusion matrix), and architecture diagrams to illustrate the approach. The results demonstrate that cVAE-based synthetic augmentation can significantly improve pan-cancer prediction performance, especially for underrepresented cancer classes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes