LGJun 9, 2022
Unveiling Transformers with LEGO: a synthetic reasoning taskYi Zhang, Arturs Backurs, Sébastien Bubeck et al. · microsoft-research
We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we manage to understand some of the attention heads as well as how the information flows in the network. In particular, we have identified a novel \emph{association} pattern that globally attends only to identical tokens. Based on these observations we propose a hypothesis that here pretraining helps for LEGO tasks due to certain structured attention patterns, and we experimentally verify this hypothesis. We also observe that in some data regime the trained transformer finds ``shortcut" solutions to follow the chain of reasoning, which impedes the model's robustness, and moreover we propose ways to prevent it. Motivated by our findings on structured attention patterns, we propose the LEGO attention module, a drop-in replacement for vanilla attention heads. This architectural change significantly reduces Flops and maintains or even \emph{improves} the model's performance at large-scale pretraining.
LGOct 24, 2022
Budget-Constrained Bounds for Mini-Batch Estimation of Optimal TransportDavid Alvarez-Melis, Nicolò Fusi, Lester Mackey et al. · harvard, microsoft-research
Optimal Transport (OT) is a fundamental tool for comparing probability distributions, but its exact computation remains prohibitive for large datasets. In this work, we introduce novel families of upper and lower bounds for the OT problem constructed by aggregating solutions of mini-batch OT problems. The upper bound family contains traditional mini-batch averaging at one extreme and a tight bound found by optimal coupling of mini-batches at the other. In between these extremes, we propose various methods to construct bounds based on a fixed computational budget. Through various experiments, we explore the trade-off between computational budget and bound tightness and show the usefulness of these bounds in computer vision applications.
DSMar 17, 2022
Triangle and Four Cycle Counting with Predictions in Graph StreamsJustin Y. Chen, Talya Eden, Piotr Indyk et al.
We propose data-driven one-pass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, (Hsu 2018) and (Jiang 2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements to improve on prior "classical" algorithms that did not use oracles. In this paper, we explore the power of a "heavy edge" oracle in multiple graph edge streaming models. In the adjacency list model, we present a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle. In the arbitrary order model, we present algorithms for both triangle and four cycle estimation with fewer passes and the same space complexity as in previous algorithms, and we show several of these bounds are optimal. We analyze our algorithms under several noise models, showing that the algorithms perform well even when the oracle errs. Our methodology expands upon prior work on "classical" streaming algorithms, as previous multi-pass and random order streaming algorithms can be seen as special cases of our algorithms, where the first pass or random order was used to implement the heavy edge oracle. Lastly, our experiments demonstrate advantages of the proposed method compared to state-of-the-art streaming algorithms.
LGJun 16, 2022
Generalization Bounds for Data-Driven Numerical Linear AlgebraPeter Bartlett, Piotr Indyk, Tal Wagner
Data-driven algorithms can adapt their internal structure or parameters to inputs from unknown application-specific distributions, by learning from a training sample of inputs. Several recent works have applied this approach to problems in numerical linear algebra, obtaining significant empirical gains in performance. However, no theoretical explanation for their success was known. In this work we prove generalization bounds for those algorithms, within the PAC-learning framework for data-driven algorithm selection proposed by Gupta and Roughgarden (SICOMP 2017). Our main results are closely matching upper and lower bounds on the fat shattering dimension of the learning-based low rank approximation algorithm of Indyk et al.~(NeurIPS 2019). Our techniques are general, and provide generalization bounds for many other recently proposed data-driven algorithms in numerical linear algebra, covering both sketching-based and multigrid-based methods. This considerably broadens the class of data-driven algorithms for which a PAC-learning analysis is available.
LGNov 6, 2022
Exponentially Improving the Complexity of Simulating the Weisfeiler-Lehman Test with Graph Neural NetworksAnders Aamand, Justin Y. Chen, Piotr Indyk et al.
Recent work shows that the expressive power of Graph Neural Networks (GNNs) in distinguishing non-isomorphic graphs is exactly the same as that of the Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test can be simulated by GNNs. However, those simulations involve neural networks for the 'combine' function of size polynomial or even exponential in the number of graph nodes $n$, as well as feature vectors of length linear in $n$. We present an improved simulation of the WL test on GNNs with \emph{exponentially} lower complexity. In particular, the neural network implementing the combine function in each node has only a polylogarithmic number of parameters in $n$, and the feature vectors exchanged by the nodes of GNN consists of only $O(\log n)$ bits. We also give logarithmic lower bounds for the feature vector length and the size of the neural networks, showing the (near)-optimality of our construction.
DSApr 15, 2023
Learned Interpolation for Better Streaming Quantile Approximation with Worst-Case GuaranteesNicholas Schiefer, Justin Y. Chen, Piotr Indyk et al.
An $\varepsilon$-approximate quantile sketch over a stream of $n$ inputs approximates the rank of any query point $q$ - that is, the number of input points less than $q$ - up to an additive error of $\varepsilon n$, generally with some probability of at least $1 - 1/\mathrm{poly}(n)$, while consuming $o(n)$ space. While the celebrated KLL sketch of Karnin, Lang, and Liberty achieves a provably optimal quantile approximation algorithm over worst-case streams, the approximations it achieves in practice are often far from optimal. Indeed, the most commonly used technique in practice is Dunning's t-digest, which often achieves much better approximations than KLL on real-world data but is known to have arbitrarily large errors in the worst case. We apply interpolation techniques to the streaming quantiles problem to attempt to achieve better approximations on real-world data sets than KLL while maintaining similar guarantees in the worst case.
DSJul 4, 2023
Fast Private Kernel Density Estimation via Locality Sensitive QuantizationTal Wagner, Yonatan Naamad, Nina Mishra
We study efficient mechanisms for differentially private kernel density estimation (DP-KDE). Prior work for the Gaussian kernel described algorithms that run in time exponential in the number of dimensions $d$. This paper breaks the exponential barrier, and shows how the KDE can privately be approximated in time linear in $d$, making it feasible for high-dimensional data. We also present improved bounds for low-dimensional data. Our results are obtained through a general framework, which we term Locality Sensitive Quantization (LSQ), for constructing private KDE mechanisms where existing KDE approximation techniques can be applied. It lets us leverage several efficient non-private KDE methods -- like Random Fourier Features, the Fast Gauss Transform, and Locality Sensitive Hashing -- and ``privatize'' them in a black-box manner. Our experiments demonstrate that our resulting DP-KDE mechanisms are fast and accurate on large datasets in both high and low dimensions.
70.2DSMay 2
New Bounds for Kernel Sums via Fast Spherical EmbeddingsTal Wagner
We study query time bounds for the fundamental problem of estimating the kernel mean $\frac1{|X|}\sum_{x\in X}\mathbf{k}(x,y)$ of a query $y$ in a finite dataset $X\subset\mathbb{R}^d$ up to a prescribed additive error $\varepsilon$. The best known bounds for the Gaussian kernel are $O(d/\varepsilon^2)$, $\widetilde O(d+1/\varepsilon^4)$, and $\widetilde O(d+Δ^2/\varepsilon^2)$, where $Δ$ is the diameter of a region containing the points. We prove the new bound $\tilde O(d+\varepsilonΔ^2+1/\varepsilon^3)$, which improves over the previous ones in regimes with small error $\varepsilon$ and intermediate diameter $Δ$. At the center of our proof is a new fast spherical embedding theorem in the sense introduced by Bartal, Recht and Schulman (2011), which limits the embedded data diameter while preserving local Euclidean distances and avoiding ``distance collapse'' at larger scales. This fast embedding theorem may be of independent interest.
DSDec 19, 2025
Graph-based Nearest Neighbors with Dynamic Updates via Random WalksNina Mishra, Yonatan Naamad, Tal Wagner et al.
Approximate nearest neighbor search (ANN) is a common way to retrieve relevant search results, especially now in the context of large language models and retrieval augmented generation. One of the most widely used algorithms for ANN is based on constructing a multi-layer graph over the dataset, called the Hierarchical Navigable Small World (HNSW). While this algorithm supports insertion of new data, it does not support deletion of existing data. Moreover, deletion algorithms described by prior work come at the cost of increased query latency, decreased recall, or prolonged deletion time. In this paper, we propose a new theoretical framework for graph-based ANN based on random walks. We then utilize this framework to analyze a randomized deletion approach that preserves hitting time statistics compared to the graph before deleting the point. We then turn this theoretical framework into a deterministic deletion algorithm, and show that it provides better tradeoff between query latency, recall, deletion time, and memory usage through an extensive collection of experiments.
LGNov 2, 2025
SpEx: A Spectral Approach to Explainable ClusteringTal Argov, Tal Wagner
Explainable clustering by axis-aligned decision trees was introduced by Moshkovitz et al. (2020) and has gained considerable interest. Prior work has focused on minimizing the price of explainability for specific clustering objectives, lacking a general method to fit an explanation tree to any given clustering, without restrictions. In this work, we propose a new and generic approach to explainable clustering, based on spectral graph partitioning. With it, we design an explainable clustering algorithm that can fit an explanation tree to any given non-explainable clustering, or directly to the dataset itself. Moreover, we show that prior algorithms can also be interpreted as graph partitioning, through a generalized framework due to Trevisan (2013) wherein cuts are optimized in two graphs simultaneously. Our experiments show the favorable performance of our method compared to baselines on a range of datasets.
33.3LGMay 10
Positional LSH: Binary Block Matrix Approximation for Attention with Linear BiasesDaniel Wolfson, Tal Wagner
Positional encoding in transformers is commonly implemented through positional embeddings, attention masks, or bias terms, but formal connections between these mechanisms remain limited. We study attention with positional bias through the lens of locality-sensitive hashing (LSH), focusing on Attention with Linear Biases (ALiBi). We show that the ALiBi bias matrix is the expectation of contiguous block-diagonal binary masks induced by a ``positional LSH'' scheme. The empirical mean of masks sampled from this scheme yields spectral norm and max-norm approximation guarantees with bounded block sizes with high probability. This structural theorem implies a uniform approximation theorem for ALiBi-biased attention: with high probability over the sampled masks, the approximate attention output is accurate simultaneously for all query-key-value inputs and can be computed in near-linear time in the context length, reducing long-context ALiBi to a collection of randomized short-context regular (positionally unbiased) attention operations. Conceptually, this connects positional bias, masks, and positional embeddings in a single formal framework and suggests an approach to efficient ALiBi-biased attention. Experiments on large language models validate our theoretical findings.
CLFeb 18, 2025
Private Text Generation by Seeding Large Language Model PromptsSupriya Nagesh, Justin Y. Chen, Nina Mishra et al.
We explore how private synthetic text can be generated by suitably prompting a large language model (LLM). This addresses a challenge for organizations like hospitals, which hold sensitive text data like patient medical records, and wish to share it in order to train machine learning models for medical tasks, while preserving patient privacy. Methods that rely on training or finetuning a model may be out of reach, either due to API limits of third-party LLMs, or due to ethical and legal prohibitions on sharing the private data with the LLM itself. We propose Differentially Private Keyphrase Prompt Seeding (DP-KPS), a method that generates a private synthetic text corpus from a sensitive input corpus, by accessing an LLM only through privatized prompts. It is based on seeding the prompts with private samples from a distribution over phrase embeddings, thus capturing the input corpus while achieving requisite output diversity and maintaining differential privacy. We evaluate DP-KPS on downstream ML text classification tasks, and show that the corpora it generates preserve much of the predictive power of the original ones. Our findings offer hope that institutions can reap ML insights by privately sharing data with simple prompts and little compute.
LGJul 31, 2025
Improved Algorithms for Kernel Matrix-Vector Multiplication Under Sparsity AssumptionsPiotr Indyk, Michael Kapralov, Kshiteej Sheth et al.
Motivated by the problem of fast processing of attention matrices, we study fast algorithms for computing matrix-vector products for asymmetric Gaussian Kernel matrices $K\in \mathbb{R}^{n\times n}$. $K$'s columns are indexed by a set of $n$ keys $k_1,k_2\ldots, k_n\in \mathbb{R}^d$, rows by a set of $n$ queries $q_1,q_2,\ldots,q_n\in \mathbb{R}^d $, and its $i,j$ entry is $K_{ij} = e^{-\|q_i-k_j\|_2^2/2σ^2}$ for some bandwidth parameter $σ>0$. Given a vector $x\in \mathbb{R}^n$ and error parameter $ε>0$, our task is to output a $y\in \mathbb{R}^n$ such that $\|Kx-y\|_2\leq ε\|x\|_2$ in time subquadratic in $n$ and linear in $d$. Our algorithms rely on the following modelling assumption about the matrices $K$: the sum of the entries of $K$ scales linearly in $n$, as opposed to worst case quadratic growth. We validate this assumption experimentally, for Gaussian kernel matrices encountered in various settings such as fast attention computation in LLMs. We obtain the first subquadratic-time algorithm that works under this assumption, for unrestricted vectors.
LGFeb 19, 2025
Learning from End User Data with Shuffled Differential Privacy over Kernel DensitiesTal Wagner
We study a setting of collecting and learning from private data distributed across end users. In the shuffled model of differential privacy, the end users partially protect their data locally before sharing it, and their data is also anonymized during its collection to enhance privacy. This model has recently become a prominent alternative to central DP, which requires full trust in a central data curator, and local DP, where fully local data protection takes a steep toll on downstream accuracy. Our main technical result is a shuffled DP protocol for privately estimating the kernel density function of a distributed dataset, with accuracy essentially matching central DP. We use it to privately learn a classifier from the end user data, by learning a private density function per class. Moreover, we show that the density function itself can recover the semantic content of its class, despite having been learned in the absence of any unprotected data. Our experiments show the favorable downstream performance of our approach, and highlight key downstream considerations and trade-offs in a practical ML deployment of shuffled DP.
LGJun 15, 2021
Learning-based Support Estimation in Sublinear TimeTalya Eden, Piotr Indyk, Shyam Narayanan et al.
We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot n^{1-Θ(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.
DSFeb 16, 2021
Faster Kernel Matrix Algebra via Density EstimationArturs Backurs, Piotr Indyk, Cameron Musco et al.
We study fast algorithms for computing fundamental properties of a positive semidefinite kernel matrix $K \in \mathbb{R}^{n \times n}$ corresponding to $n$ points $x_1,\ldots,x_n \in \mathbb{R}^d$. In particular, we consider estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector. We show that the sum of matrix entries can be estimated to $1+ε$ relative error in time $sublinear$ in $n$ and linear in $d$ for many popular kernels, including the Gaussian, exponential, and rational quadratic kernels. For these kernels, we also show that the top eigenvalue (and an approximate eigenvector) can be approximated to $1+ε$ relative error in time $subquadratic$ in $n$ and linear in $d$. Our algorithms represent significant advances in the best known runtimes for these problems. They leverage the positive definiteness of the kernel matrix, along with a recent line of work on efficient kernel density estimation.
DSJun 2, 2019
Sample-Optimal Low-Rank Approximation of Distance MatricesPiotr Indyk, Ali Vakilian, Tal Wagner et al.
A distance matrix $A \in \mathbb R^{n \times m}$ represents all pairwise distances, $A_{ij}=\mathrm{d}(x_i,y_j)$, between two point sets $x_1,...,x_n$ and $y_1,...,y_m$ in an arbitrary metric space $(\mathcal Z, \mathrm{d})$. Such matrices arise in various computational contexts such as learning image manifolds, handwriting recognition, and multi-dimensional unfolding. In this work we study algorithms for low-rank approximation of distance matrices. Recent work by Bakshi and Woodruff (NeurIPS 2018) showed it is possible to compute a rank-$k$ approximation of a distance matrix in time $O((n+m)^{1+γ}) \cdot \mathrm{poly}(k,1/ε)$, where $ε>0$ is an error parameter and $γ>0$ is an arbitrarily small constant. Notably, their bound is sublinear in the matrix size, which is unachievable for general matrices. We present an algorithm that is both simpler and more efficient. It reads only $O((n+m) k/ε)$ entries of the input matrix, and has a running time of $O(n+m) \cdot \mathrm{poly}(k,1/ε)$. We complement the sample complexity of our algorithm with a matching lower bound on the number of entries that must be read by any algorithm. We provide experimental results to validate the approximation quality and running time of our algorithm.
DSFeb 10, 2019
Scalable Fair ClusteringArturs Backurs, Piotr Indyk, Krzysztof Onak et al.
We study the fair variant of the classic $k$-median problem introduced by Chierichetti et al. [2017]. In the standard $k$-median problem, given an input pointset $P$, the goal is to find $k$ centers $C$ and assign each input point to one of the centers in $C$ such that the average distance of points to their cluster center is minimized. In the fair variant of $k$-median, the points are colored, and the goal is to minimize the same average distance objective while ensuring that all clusters have an "approximately equal" number of points of each color. Chierichetti et al. proposed a two-phase algorithm for fair $k$-clustering. In the first step, the pointset is partitioned into subsets called fairlets that satisfy the fairness requirement and approximately preserve the $k$-median objective. In the second step, fairlets are merged into $k$ clusters by one of the existing $k$-median algorithms. The running time of this algorithm is dominated by the first step, which takes super-quadratic time. In this paper, we present a practical approximate fairlet decomposition algorithm that runs in nearly linear time. Our algorithm additionally allows for finer control over the balance of resulting clusters than the original work. We complement our theoretical bounds with empirical evaluation.
LGJan 24, 2019
Learning Space Partitions for Nearest Neighbor SearchYihe Dong, Piotr Indyk, Ilya Razenshteyn et al.
Space partitions of $\mathbb{R}^d$ underlie a vast and important class of fast nearest neighbor search (NNS) algorithms. Inspired by recent theoretical work on NNS for general metric spaces [Andoni, Naor, Nikolov, Razenshteyn, Waingarten STOC 2018, FOCS 2018], we develop a new framework for building space partitions reducing the problem to balanced graph partitioning followed by supervised classification. We instantiate this general approach with the KaHIP graph partitioner [Sanders, Schulz SEA 2013] and neural networks, respectively, to obtain a new partitioning procedure called Neural Locality-Sensitive Hashing (Neural LSH). On several standard benchmarks for NNS, our experiments show that the partitions obtained by Neural LSH consistently outperform partitions found by quantization-based and tree-based methods as well as classic, data-oblivious LSH.