Jakub Tětek

DS
5papers
39citations
Novelty68%
AI Score48

5 Papers

25.1DSJun 1
SplineSketch: Even More Accurate Quantiles with Error Guarantees

Aleksander Łukasiewicz, Jakub Tětek, Pavel Veselý

Space-efficient streaming estimation of quantiles in massive datasets is a fundamental problem with numerous applications in data monitoring and analysis. While theoretical research led to optimal algorithms, such as the Greenwald-Khanna algorithm or the KLL sketch, practitioners often use other sketches that perform significantly better in practice but lack theoretical guarantees. Most notably, the widely used $t$-digest has unbounded worst-case error. In this paper, we seek to get the best of both worlds. We present a new quantile summary, SplineSketch, for numeric data, offering near-optimal theoretical guarantees, namely uniformly bounded rank error, and outperforming $t$-digest by a factor of 2-20 on a range of synthetic and real-world datasets. To achieve such performance, we develop a novel approach that maintains a dynamic subdivision of the input range into buckets while fitting the input distribution using monotone cubic spline interpolation. The core challenge is implementing this method in a space-efficient manner while ensuring strong worst-case guarantees.

DSJul 16, 2022
A Nearly Tight Analysis of Greedy k-means++

Christoph Grunau, Ahmet Alper Özüdoğru, Václav Rozhoň et al.

The famous $k$-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is the most popular way of solving the $k$-means problem in practice. The algorithm is very simple: it samples the first center uniformly at random and each of the following $k-1$ centers is then always sampled proportional to its squared distance to the closest center so far. Afterward, Lloyd's iterative algorithm is run. The $k$-means++ algorithm is known to return a $Θ(\log k)$ approximate solution in expectation. In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the guarantees for its following \emph{greedy} variant: in every step, we sample $\ell$ candidate centers instead of one and then pick the one that minimizes the new cost. This is also how $k$-means++ is implemented in e.g. the popular Scikit-learn library [Pedregosa et al.; JMLR 2011]. We present nearly matching lower and upper bounds for the greedy $k$-means++: We prove that it is an $O(\ell^3 \log^3 k)$-approximation algorithm. On the other hand, we prove a lower bound of $Ω(\ell^3 \log^3 k / \log^2(\ell\log k))$. Previously, only an $Ω(\ell \log k)$ lower bound was known [Bhattacharya, Eube, Röglin, Schmidt; ESA 2020] and there was no known upper bound.

20.0DSMay 5
Fast and Simple Sorting Using Partial Information

Bernhard Haeupler, Richard Hladík, John Iacono et al.

We consider the problem of sorting $n$ items, given the outcomes of $m$ pre-existing comparisons. We present a simple and natural deterministic algorithm that runs in $O(m + \log T)$ time and does $O(\log T)$ comparisons, where $T$ is the number of total orders consistent with the pre-existing comparisons. Our running time and comparison bounds are best possible up to constant factors, thus resolving a problem that has been studied intensely since 1976 (Fredman, Theoretical Computer Science). The best previous algorithm with a bound of $O(\log T)$ on the number of comparisons has a time bound of $O(n^{2.5})$ and is more complicated. Our algorithm combines three classic algorithms: topological sort, heapsort with the right kind of heap, and efficient search in a sorted list. It outputs the items in sorted order one by one. It can be modified to stop early, thereby solving the important and more general top-$k$ sorting problem: Given $k$ and the outcomes of some pre-existing comparisons, output the smallest $k$ items in sorted order. The modified algorithm solves the top-$k$ sorting problem in minimum time and comparisons, to within constant factors.

DSFeb 3, 2021
CountSketches, Feature Hashing and the Median of Three

Kasper Green Larsen, Rasmus Pagh, Jakub Tětek

In this paper, we revisit the classic CountSketch method, which is a sparse, random projection that transforms a (high-dimensional) Euclidean vector $v$ to a vector of dimension $(2t-1) s$, where $t, s > 0$ are integer parameters. It is known that even for $t=1$, a CountSketch allows estimating coordinates of $v$ with variance bounded by $\|v\|_2^2/s$. For $t > 1$, the estimator takes the median of $2t-1$ independent estimates, and the probability that the estimate is off by more than $2 \|v\|_2/\sqrt{s}$ is exponentially small in $t$. This suggests choosing $t$ to be logarithmic in a desired inverse failure probability. However, implementations of CountSketch often use a small, constant $t$. Previous work only predicts a constant factor improvement in this setting. Our main contribution is a new analysis of Count-Sketch, showing an improvement in variance to $O(\min\{\|v\|_1^2/s^2,\|v\|_2^2/s\})$ when $t > 1$. That is, the variance decreases proportionally to $s^{-2}$, asymptotically for large enough $s$. We also study the variance in the setting where an inner product is to be estimated from two CountSketches. This finding suggests that the Feature Hashing method, which is essentially identical to CountSketch but does not make use of the median estimator, can be made more reliable at a small cost in settings where using a median estimator is possible. We confirm our theoretical findings in experiments and thereby help justify why a small constant number of estimates often suffice in practice. Our improved variance bounds are based on new general theorems about the variance and higher moments of the median of i.i.d. random variables that may be of independent interest.

AINov 12, 2020
Performance of Bounded-Rational Agents With the Ability to Self-Modify

Jakub Tětek, Marek Sklenka, Tomáš Gavenčiak

Self-modification of agents embedded in complex environments is hard to avoid, whether it happens via direct means (e.g. own code modification) or indirectly (e.g. influencing the operator, exploiting bugs or the environment). It has been argued that intelligent agents have an incentive to avoid modifying their utility function so that their future instances work towards the same goals. Everitt et al. (2016) formally show that providing an option to self-modify is harmless for perfectly rational agents. We show that this result is no longer true for agents with bounded rationality. In such agents, self-modification may cause exponential deterioration in performance and gradual misalignment of a previously aligned agent. We investigate how the size of this effect depends on the type and magnitude of imperfections in the agent's rationality (1-4 below). We also discuss model assumptions and the wider problem and framing space. We examine four ways in which an agent can be bounded-rational: it either (1) doesn't always choose the optimal action, (2) is not perfectly aligned with human values, (3) has an inaccurate model of the environment, or (4) uses the wrong temporal discounting factor. We show that while in the cases (2)-(4) the misalignment caused by the agent's imperfection does not increase over time, with (1) the misalignment may grow exponentially.