GNLGApr 6, 2023

ViralVectors: Compact and Scalable Alignment-free Virome Feature Generation

arXiv:2304.02891v28 citationsh-index: 16
Originality Incremental advance
AI Analysis

This addresses the need for scalable methods to handle rapidly growing viral genomic data for surveillance and decision-making, representing an incremental advance by applying an existing technique (minimizers) in a new way.

The authors tackled the problem of processing large and heterogeneous virome sequencing data by proposing ViralVectors, a compact feature generation method based on minimizers, which outperforms current benchmarks in classification and clustering tasks on datasets including 2.5M SARS-CoV-2 spike sequences.

The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide or amino acid sequencing reads pertaining to the whole genome or regions (e.g., spike) of interest. In this work, we propose \emph{ViralVectors}, a compact feature vector generation from virome sequencing data that allows effective downstream analysis. Such generation is based on \emph{minimizers}, a type of lightweight "signature" of a sequence, used traditionally in assembly and read mapping -- to our knowledge, the first use minimizers in this way. We validate our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show robustness to more genomic variability); and (c) 4K raw WGS reads sets taken from nasal-swab PCR tests (to show the ability to process unassembled reads). Our results show that ViralVectors outperforms current benchmarks in most classification and clustering tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes