LG AIApr 14, 2022

SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Samuel Cahyawijaya, Tiezheng Yu, Zihan Liu, Tiffany T. W. Mak, Xiaopu Zhou, Nancy Y. Ip, Pascale Fung

arXiv:2204.06699v152.4640 citationsh-index: 79Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of applying self-supervised learning to genetic variations, which is crucial for genome-wide association studies, representing an incremental advancement over haploid sequence methods.

The paper tackles the problem of understanding single nucleotide polymorphisms (SNPs) for genome-wide association studies by introducing SNP2Vec, a scalable self-supervised pre-training approach. It significantly outperforms existing polygenic risk score methods and other baselines in predicting Alzheimer's disease risk in a Chinese cohort.

Self-supervised pre-training methods have brought remarkable breakthroughs in the understanding of text, image, and speech. Recent developments in genomics has also adopted these pre-training methods for genome understanding. However, they focus only on understanding haploid sequences, which hinders their applicability towards understanding genetic variations, also known as single nucleotide polymorphisms (SNPs), which is crucial for genome-wide association study. In this paper, we introduce SNP2Vec, a scalable self-supervised pre-training approach for understanding SNP. We apply SNP2Vec to perform long-sequence genomics modeling, and we evaluate the effectiveness of our approach on predicting Alzheimer's disease risk in a Chinese cohort. Our approach significantly outperforms existing polygenic risk score methods and all other baselines, including the model that is trained entirely with haploid sequences. We release our code and dataset on https://github.com/HLTCHKUST/snp2vec.

View on arXiv PDF Code

Similar