Evaluating unsupervised disentangled representation learning for genomic discovery and disease risk prediction
This work addresses the challenge of enhancing genomic discovery and disease risk prediction for clinical and genetic research, though it is incremental as it builds on existing VAE-based approaches.
The study tackled the problem of improving genetic association studies and disease risk prediction by comparing unsupervised disentangled representation learning methods, such as autoencoders, VAE, beta-VAE, and FactorVAE, using spirograms from UK Biobank. The result showed that FactorVAE and beta-VAE led to improvements in genome-wide significant loci, heritability, and polygenic risk scores for asthma and chronic obstructive pulmonary disease compared to standard methods.
High-dimensional clinical data have become invaluable resources for genetic studies, due to their accessibility in biobank-scale datasets and the development of high performance modeling techniques especially using deep learning. Recent work has shown that low dimensional embeddings of these clinical data learned by variational autoencoders (VAE) can be used for genome-wide association studies and polygenic risk prediction. In this work, we consider multiple unsupervised learning methods for learning disentangled representations, namely autoencoders, VAE, beta-VAE, and FactorVAE, in the context of genetic association studies. Using spirograms from UK Biobank as a running example, we observed improvements in the number of genome-wide significant loci, heritability, and performance of polygenic risk scores for asthma and chronic obstructive pulmonary disease by using FactorVAE or beta-VAE, compared to standard VAE or non-variational autoencoders. FactorVAEs performed effectively across multiple values of the regularization hyperparameter, while beta-VAEs were much more sensitive to the hyperparameter values.