Samuel McCauley

DS
h-index11
4papers
14citations
Novelty53%
AI Score41

4 Papers

10.8DSApr 1
Space-Efficient Text Indexing with Mismatches using Function Inversion

Jackson Bibbens, Levi Borevitz, Samuel McCauley

A classic data structure problem is to preprocess a string T of length $n$ so that, given a query $q$, we can quickly find all substrings of T with Hamming distance at most $k$ from the query string. Variants of this problem have seen significant research both in theory and in practice. For a wide parameter range, the best worst-case bounds are achieved by the "CGL tree" (Cole, Gottlieb, Lewenstein 2004), which achieves query time roughly $\tilde{O}(|q| + \log^k n + \# occ)$ where $\# occ$ is the size of the output, and space ${O}(n\log^k n)$. The CGL Tree space was recently improved to $O(n \log^{k-1} n)$ (Kociumaka, Radoszewski 2026). A natural question is whether a high space bound is necessary. How efficient can we make queries when the data structure is constrained to $O(n)$ space? While this question has seen extensive research, all known results have query time with unfavorable dependence on $n$, $k$, and the alphabet $Σ$. The state of the art query time (Chan et al. 2011) is roughly $\tilde{O}(|q| + |Σ|^k \log^{k^2 + k} n + \# occ)$. We give an $O(n)$-space data structure with query time roughly $\tilde{O}(|q| + \log^{4k} n + \log^{2k} n \# occ)$, with no dependence on $|Σ|$. Even if $|Σ| = O(1)$, this is the best known query time for linear space if $k\geq 3$ unless $\# occ$ is large. Our results give a smooth tradeoff between time and space. We also give the first sublinear-space results: we give a succinct data structure using only $o(n)$ space in addition to the text itself. Our main technical idea is to apply function inversion (Fiat, Naor 2000) to the CGL tree. Combining these techniques is not immediate; in fact, we revisit the exposition of both the Fiat-Naor data structure and the CGL tree to obtain our bounds. Along the way, we obtain improved performance for both data structures, which may be of independent interest.

47.3DSApr 28
Incremental Strongly Connected Components with Predictions

Ronald Deng, Samuel McCauley, Aidin Niaparast et al.

Algorithms with predictions is a growing area that aims to leverage machine-learned predictions to design faster beyond-worst-case algorithms. In this paper, we use this framework to design a learned data structure for the incremental strongly connected components (SCC) problem. In this problem, the $n$ vertices of a graph are known a priori and the $m$ directed edges arrive over time. The goal is to efficiently maintain the strongly connected components of the graph after each insert. Our algorithm receives a possibly erroneous prediction of the edge sequence and uses it to precompute partial solutions to support fast inserts. We show that our algorithm achieves nearly optimal bounds with good predictions and its performance smoothly degrades with the prediction error. We also implement our data structure and perform experiments on real datasets. Our empirical results show that the theory is predictive of practical runtime improvements.

DSFeb 12, 2025
Incremental Approximate Single-Source Shortest Paths with Predictions

Samuel McCauley, Benjamin Moseley, Aidin Niaparast et al.

The algorithms-with-predictions framework has been used extensively to develop online algorithms with improved beyond-worst-case competitive ratios. Recently, there is growing interest in leveraging predictions for designing data structures with improved beyond-worst-case running times. In this paper, we study the fundamental data structure problem of maintaining approximate shortest paths in incremental graphs in the algorithms-with-predictions model. Given a sequence $σ$ of edges that are inserted one at a time, the goal is to maintain approximate shortest paths from the source to each vertex in the graph at each time step. Before any edges arrive, the data structure is given a prediction of the online edge sequence $\hatσ$ which is used to ``warm start'' its state. As our main result, we design a learned algorithm that maintains $(1+ε)$-approximate single-source shortest paths, which runs in $\tilde{O}(m η\log W/ε)$ time, where $W$ is the weight of the heaviest edge and $η$ is the prediction error. We show these techniques immediately extend to the all-pairs shortest-path setting as well. Our algorithms are consistent (performing nearly as fast as the offline algorithm) when predictions are nearly perfect, have a smooth degradation in performance with respect to the prediction error and, in the worst case, match the best offline algorithm up to logarithmic factors. As a building block, we study the offline incremental approximate single-source shortest-paths problem. In this problem, the edge sequence $σ$ is known a priori and the goal is to efficiently return the length of the shortest paths in the intermediate graph $G_t$ consisting of the first $t$ edges, for all $t$. Note that the offline incremental problem is defined in the worst-case setting (without predictions) and is of independent interest.

DSMay 17, 2023
Online List Labeling with Predictions

Samuel McCauley, Benjamin Moseley, Aidin Niaparast et al.

A growing line of work shows how learned predictions can be used to break through worst-case barriers to improve the running time of an algorithm. However, incorporating predictions into data structures with strong theoretical guarantees remains underdeveloped. This paper takes a step in this direction by showing that predictions can be leveraged in the fundamental online list labeling problem. In the problem, n items arrive over time and must be stored in sorted order in an array of size Theta(n). The array slot of an element is its label and the goal is to maintain sorted order while minimizing the total number of elements moved (i.e., relabeled). We design a new list labeling data structure and bound its performance in two models. In the worst-case learning-augmented model, we give guarantees in terms of the error in the predictions. Our data structure provides strong guarantees: it is optimal for any prediction error and guarantees the best-known worst-case bound even when the predictions are entirely erroneous. We also consider a stochastic error model and bound the performance in terms of the expectation and variance of the error. Finally, the theoretical results are demonstrated empirically. In particular, we show that our data structure has strong performance on real temporal data sets where predictions are constructed from elements that arrived in the past, as is typically done in a practical use case.