Transformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer Prognosis
This addresses the challenge of robust gene expression modeling for cancer prognosis, offering a scalable tool with translational potential, though it is incremental as it adapts existing transformer methods to a new domain.
The paper tackled the problem of applying transformer models to gene expression analysis by developing GexBERT, a transformer-based autoencoder, which achieved state-of-the-art accuracy in pan-cancer classification, improved survival prediction, and outperformed conventional imputation methods under high missingness.
Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretraining on large-scale transcriptomic profiles with a masking and restoration objective that captures co-expression relationships among thousands of genes. We evaluate GexBERT across three critical tasks in cancer research: pan-cancer classification, cancer-specific survival prediction, and missing value imputation. GexBERT achieves state-of-the-art classification accuracy from limited gene subsets, improves survival prediction by restoring expression of prognostic anchor genes, and outperforms conventional imputation methods under high missingness. Furthermore, its attention-based interpretability reveals biologically meaningful gene patterns across cancer types. These findings demonstrate the utility of GexBERT as a scalable and effective tool for gene expression modeling, with translational potential in settings where gene coverage is limited or incomplete.