GNAILGApr 24, 2023

Virus2Vec: Viral Sequence Classification Using Machine Learning

arXiv:2304.12328v15 citationsh-index: 20
Originality Synthesis-oriented
AI Analysis

This work addresses a domain-specific problem for epidemiologists and medical professionals by improving viral host classification, though it appears incremental as it builds on existing techniques like minimizers and position weight matrices.

The authors tackled the problem of predicting viral hosts from nucleotide or amino acid sequences by proposing Virus2Vec, a feature-vector representation that eliminates the need for sequence alignment. Their method outperformed baseline and state-of-the-art methods in predictive accuracy on real-world data from Coronaviridae and rabies virus sequences.

Understanding the host-specificity of different families of viruses sheds light on the origin of, e.g., SARS-CoV-2, rabies, and other such zoonotic pathogens in humans. It enables epidemiologists, medical professionals, and policymakers to curb existing epidemics and prevent future ones promptly. In the family Coronaviridae (of which SARS-CoV-2 is a member), it is well-known that the spike protein is the point of contact between the virus and the host cell membrane. On the other hand, the two traditional mammalian orders, Carnivora (carnivores) and Chiroptera (bats) are recognized to be responsible for maintaining and spreading the Rabies Lyssavirus (RABV). We propose Virus2Vec, a feature-vector representation for viral (nucleotide or amino acid) sequences that enable vector-space-based machine learning models to identify viral hosts. Virus2Vec generates numerical feature vectors for unaligned sequences, allowing us to forego the computationally expensive sequence alignment step from the pipeline. Virus2Vec leverages the power of both the \emph{minimizer} and position weight matrix (PWM) to generate compact feature vectors. Using several classifiers, we empirically evaluate Virus2Vec on real-world spike sequences of Coronaviridae and rabies virus sequence data to predict the host (identifying the reservoirs of infection). Our results demonstrate that Virus2Vec outperforms the predictive accuracies of baseline and state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes