23.0DSMay 5
Faster Iterative $ϕ$ Queries on the Positional BWTPaola Bonizzoni, Travis Gagie, Younan Gao
The Positional Burrows-Wheeler Transform (PBWT) is a fundamental data structure for the efficient representation and analysis of large-scale haplotype panels. For a panel of $h$ sequences $\{S_1, \dots, S_h\}$ over $m$ sites, a key operation is the $ϕ_j(i)$ query, which returns the haplotype index immediately preceding $S_i$ in co-lexicographic order at site $j$. Efficient support for $k$ iterative queries $ϕ^1, \dots, ϕ^k$ is essential for haplotype matching and variation analysis. In this work, we introduce a simple and novel decomposition scheme that decomposes each haplotype row into sub-intervals, called refined segments, within which a haplotype's co-lexicographic predecessor for the sites remains unchanged. We show that refined segments satisfy two key properties: (i) each segment $[b,e]$ associated with $S_i$ overlaps with at most a constant number of segments of $S_{ϕ_e(i)}$, and (ii) the total number of segments is bounded by $O(\tilde{r} + h)$, where $\tilde{r}$ denotes the number of runs in the PBWT. Building on this decomposition, we present two space-time tradeoffs for supporting $k$ iterative $ϕ$ queries: (i) a structure using $O((\tilde{r} + h)\log n)$ bits of space that answers $k$ iterative queries in $O(\log \log_w \min(m,h) + k)$ time, where $n = m \cdot h$, and (ii) a more compact structure using $O(\tilde{r} \log h + h \log n)$ bits of space that supports queries in $O(k \log \log_w h)$ time. Prior to our work, supporting these queries required $O((\tilde{r} + h)\log n)$ bits of space and $O(k \cdot \log \log_w m)$ time. Our second tradeoff is expected to be effective in practice for modern genomic datasets, where the number $h$ of haplotypes is typically much smaller than the number $m$ of sites.
43.2DSMay 5
Constructing Suffixient Arrays RevisitedPaola Bonizzoni, Younan Gao, Brian Riccardi
Recently, Cenzato et al.\ proposed a new text index, called the \emph{suffixient array}, which is a subset of the suffix array and supports locating a single pattern occurrence or finding its maximal exact matches (MEMs), assuming random access to the input text $T[1..n]$ is available. They show that, given the suffix array, the longest common prefix array, and the Burrows--Wheeler transform (BWT) of the reverse of $T[1..n]$ over an alphabet $\{1,\ldots,σ\}$, a suffixient array can be constructed in linear time. However, their construction algorithms require multiple scans of these arrays. When restricted to a single pass over the arrays, they present an alternative construction algorithm running in $O(n + \overline{r} \log σ)$ time, where $\overline{r}$ is the number of runs in the BWT of the reversed text. In this paper, we present a new one-pass algorithm that constructs a suffixient array in linear time under the standard RAM model.
GNFeb 28, 2022
Numeric Lyndon-based feature embedding of sequencing reads for machine learning approachesPaola Bonizzoni, Matteo Costantini, Clelia De Felice et al.
Feature embedding methods have been proposed in literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlapping strings. Surprisingly, the fingerprint of a sequencing read, which is the sequence of lengths of consecutive factors in variants of the Lyndon factorization of the read, is effective in preserving sequence similarities, suggesting it as basis for the definition of novels representations of sequencing reads. We propose a novel feature embedding method for Next-Generation Sequencing (NGS) data using the notion of fingerprint. We provide a theoretical and experimental framework to estimate the behaviour of fingerprints and of the $k$-mers extracted from it, called $k$-fingers, as possible feature embeddings for sequencing reads. As a case study to assess the effectiveness of such embeddings, we use fingerprints to represent RNA-Seq reads and to assign them to the most likely gene from which they were originated as fragments of transcripts of the gene. We provide an implementation of the proposed method in the tool lyn2vec, which produces Lyndon-based feature embeddings of sequencing reads.