Using Deep Learning Sequence Models to Identify SARS-CoV-2 Divergence
This work addresses the challenge of tracking SARS-CoV-2 mutations for scientists involved in vaccine development and public health, though it appears incremental as it builds on existing deep learning techniques.
The study tackled the problem of identifying SARS-CoV-2 divergence by proposing a neural network model that classifies clades from amino acid sequences of spike proteins, achieving results that offer a computationally efficient alternative to current homology-based methods.
SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021. With thousands of strains sequenced to date, SARS-CoV-2 mutations pose significant challenges to scientists on keeping pace with vaccine development and public health measures. Therefore, an efficient method of identifying the divergence of lab samples from patients would greatly aid the documentation of SARS-CoV-2 genomics. In this study, we propose a neural network model that leverages recurrent and convolutional units to directly take in amino acid sequences of spike proteins and classify corresponding clades. We also compared our model's performance with Bidirectional Encoder Representations from Transformers (BERT) pre-trained on protein database. Our approach has the potential of providing a more computationally efficient alternative to current homology based intra-species differentiation.