Nathaniel K. Brown

2papers

2 Papers

76.7DSMar 23
Optimal-Time Move Structure Balancing and LCP Array Computation from the RLBWT

Nathaniel K. Brown, Ahsan Sanaullah, Shaojie Zhang et al.

On repetitive text collections of size $n$, the Burrows-Wheeler Transform (BWT) tends to have relatively fewer runs $r$ in its run-length encoded BWT (RLBWT). This motivates many RLBWT-related algorithms and data structures that can be designed in compressed $O(r)$-space. These approaches often use the RLBWT-derived permutations LF, FL, $ϕ$, and $ϕ^{-1}$, which can be represented using a move structure to obtain optimal $O(1)$-time for each permutation step in $O(r)$-space. They are then used to construct compressed space text indexes supporting efficient pattern matching queries. However, move structure construction in $O(r)$-space requires an $O(r \log r)$-time balancing stage. The longest common prefix array (LCP) of a text collection is used to support pattern matching queries and data structure construction. Recently, it was shown how to compute the LCP array in $O(n + r \log r)$-time and $O(r)$ additional space from an RLBWT. However, the bottleneck remains the $O(r \log r)$-time move structure balancing stage. In this paper, we describe an optimal $O(r)$-time and space algorithm to balance a move structure. This result is then applied to LCP construction from an RLBWT to obtain an optimal $O(n)$-time algorithm in $O(r)$-space in addition to the output, which implies an optimal-time algorithm for LCP array enumeration in compressed $O(r)$-space.

44.5DSApr 15
Bounding the Average Move Structure Query for Faster and Smaller RLBWT Permutations

Nathaniel K. Brown, Ben Langmead

The move structure represents permutations with long contiguously permuted intervals in compressed space with optimal query time. They have become an important feature of compressed text indexes using space proportional to the number of Burrows-Wheeler Transform (BWT) runs, often applied in genomics. This is in thanks not only to theoretical improvements over past approaches, but great cache efficiency and average case query time in practice. This is true even without using the worst case guarantees provided by the interval splitting balancing of the original result. In this paper, we show that an even simpler type of splitting, length capping by truncating long intervals, bounds the average move structure query time to optimal whilst obtaining a superior construction time than the traditional approach. This also proves constant query time when amortized over a full traversal of a single cycle permutation from an arbitrary starting position. Such a scheme has surprising benefits both in theory and practice. For a move structure with $r$ runs over a domain $n$, we replace all $O(r \log n)$-bit components to reduce the overall representation by $O(r \log r)$-bits. The worst case query time is also improved to $O(\log \frac{n}{r})$ without balancing. An $O(r)$-time and $O(r)$-space construction lets us apply the method to run-length encoded BWT (RLBWT) permutations such as LF and $ϕ$ to obtain optimal-time algorithms for BWT inversion and suffix array (SA) enumeration in $O(r)$ additional working space. Finally, we introduce the Orbit library for move structure support, and use it to evaluate our splitting approach. Experiments find length capping construction is faster and uses less memory than balancing, with faster queries. We also see a space reduction in practice, with at least a $\sim 40\%$ disk size decrease for LF across large repetitive genomic collections.