LGMar 10, 2023Code
Exphormer: Sparse Transformers for GraphsHamed Shirzad, Ameya Velingker, Balaji Venkatachalam et al.
Graph transformers have emerged as a promising architecture for a variety of graph learning and representation tasks. Despite their successes, though, it remains challenging to scale graph transformers to large graphs while maintaining accuracy competitive with message-passing networks. In this paper, we introduce Exphormer, a framework for building powerful and scalable graph transformers. Exphormer consists of a sparse attention mechanism based on two mechanisms: virtual global nodes and expander graphs, whose mathematical characteristics, such as spectral expansion, pseduorandomness, and sparsity, yield graph transformers with complexity only linear in the size of the graph, while allowing us to prove desirable theoretical properties of the resulting transformer models. We show that incorporating Exphormer into the recently-proposed GraphGPS framework produces models with competitive empirical results on a wide variety of graph datasets, including state-of-the-art results on three datasets. We also show that Exphormer can scale to datasets on larger graphs than shown in previous graph transformer architectures. Code can be found at \url{https://github.com/hamed1375/Exphormer}.
CCApr 2
Linear Space Streaming Lower Bounds for Approximating CSPsChi-Ning Chou, Alexander Golovnev, Madhu Sudan et al.
We consider the approximability of constraint satisfaction problems in the streaming setting. For every constraint satisfaction problem (CSP) on $n$ variables taking values in $\{0,\ldots,q-1\}$, we prove that improving over the trivial approximability by a factor of $q$ requires $Ω(n)$ space even on instances with $O(n)$ constraints. We also identify a broad subclass of problems for which any improvement over the trivial approximability requires $Ω(n)$ space. The key technical core is an optimal, $q^{-(k-1)}$-inapproximability for the Max $k$-LIN-$\bmod\; q$ problem, which is the Max CSP problem where every constraint is given by a system of $k-1$ linear equations $\bmod\; q$ over $k$ variables. Our work builds on and extends the breakthrough work of Kapralov and Krachun (Proc. STOC 2019) who showed a linear lower bound on any non-trivial approximation of the MaxCut problem in graphs. MaxCut corresponds roughly to the case of Max $k$-LIN-$\bmod\; q$ with ${k=q=2}$. For general CSPs in the streaming setting, prior results only yielded $Ω(\sqrt{n})$ space bounds. In particular no linear space lower bound was known for an approximation factor less than $1/2$ for any CSP. Extending the work of Kapralov and Krachun to Max $k$-LIN-$\bmod\; q$ to $k>2$ and $q>2$ (while getting optimal hardness results) is the main technical contribution of this work. Each one of these extensions provides non-trivial technical challenges that we overcome in this work.
DSJun 2, 2023
Fast $(1+\varepsilon)$-Approximation Algorithms for Binary Matrix FactorizationAmeya Velingker, Maximilian Vötsch, David P. Woodruff et al.
We introduce efficient $(1+\varepsilon)$-approximation algorithms for the binary matrix factorization (BMF) problem, where the inputs are a matrix $\mathbf{A}\in\{0,1\}^{n\times d}$, a rank parameter $k>0$, as well as an accuracy parameter $\varepsilon>0$, and the goal is to approximate $\mathbf{A}$ as a product of low-rank factors $\mathbf{U}\in\{0,1\}^{n\times k}$ and $\mathbf{V}\in\{0,1\}^{k\times d}$. Equivalently, we want to find $\mathbf{U}$ and $\mathbf{V}$ that minimize the Frobenius loss $\|\mathbf{U}\mathbf{V} - \mathbf{A}\|_F^2$. Before this work, the state-of-the-art for this problem was the approximation algorithm of Kumar et. al. [ICML 2019], which achieves a $C$-approximation for some constant $C\ge 576$. We give the first $(1+\varepsilon)$-approximation algorithm using running time singly exponential in $k$, where $k$ is typically a small integer. Our techniques generalize to other common variants of the BMF problem, admitting bicriteria $(1+\varepsilon)$-approximation algorithms for $L_p$ loss functions and the setting where matrix operations are performed in $\mathbb{F}_2$. Our approach can be implemented in standard big data models, such as the streaming or distributed models.
LGOct 2, 2023
Locality-Aware Graph-Rewiring in GNNsFederico Barbero, Ameya Velingker, Amin Saberi et al.
Graph Neural Networks (GNNs) are popular models for machine learning on graphs that typically follow the message-passing paradigm, whereby the feature of a node is updated recursively upon aggregating information over its neighbors. While exchanging messages over the input graph endows GNNs with a strong inductive bias, it can also make GNNs susceptible to over-squashing, thereby preventing them from capturing long-range interactions in the given graph. To rectify this issue, graph rewiring techniques have been proposed as a means of improving information flow by altering the graph connectivity. In this work, we identify three desiderata for graph-rewiring: (i) reduce over-squashing, (ii) respect the locality of the graph, and (iii) preserve the sparsity of the graph. We highlight fundamental trade-offs that occur between spatial and spectral rewiring techniques; while the former often satisfy (i) and (ii) but not (iii), the latter generally satisfy (i) and (iii) at the expense of (ii). We propose a novel rewiring framework that satisfies all of (i)--(iii) through a locality-aware sequence of rewiring operations. We then discuss a specific instance of such rewiring framework and validate its effectiveness on several real-world benchmarks, showing that it either matches or significantly outperforms existing rewiring approaches.
LGJun 23, 2022
Affinity-Aware Graph NetworksAmeya Velingker, Ali Kemal Sinop, Ira Ktena et al.
Graph Neural Networks (GNNs) have emerged as a powerful technique for learning on relational data. Owing to the relatively limited number of message passing steps they perform -- and hence a smaller receptive field -- there has been significant interest in improving their expressivity by incorporating structural aspects of the underlying graph. In this paper, we explore the use of affinity measures as features in graph neural networks, in particular measures arising from random walks, including effective resistance, hitting and commute times. We propose message passing networks based on these features and evaluate their performance on a variety of node and graph property prediction tasks. Our architecture has lower computational complexity, while our features are invariant to the permutations of the underlying graph. The measures we compute allow the network to exploit the connectivity properties of the graph, thereby allowing us to outperform relevant benchmarks for a wide variety of tasks, often with significantly fewer message passing steps. On one of the largest publicly available graph regression datasets, OGB-LSC-PCQM4Mv1, we obtain the best known single-model validation MAE at the time of writing.
LGFeb 12, 2024
Weisfeiler-Leman at the margin: When more expressivity mattersBilly J. Franks, Christopher Morris, Ameya Velingker et al.
The Weisfeiler-Leman algorithm ($1$-WL) is a well-studied heuristic for the graph isomorphism problem. Recently, the algorithm has played a prominent role in understanding the expressive power of message-passing graph neural networks (MPNNs) and being effective as a graph kernel. Despite its success, $1$-WL faces challenges in distinguishing non-isomorphic graphs, leading to the development of more expressive MPNN and kernel architectures. However, the relationship between enhanced expressivity and improved generalization performance remains unclear. Here, we show that an architecture's expressivity offers limited insights into its generalization performance when viewed through graph isomorphism. Moreover, we focus on augmenting $1$-WL and MPNNs with subgraph information and employ classical margin theory to investigate the conditions under which an architecture's increased expressivity aligns with improved generalization performance. In addition, we show that gradient flow pushes the MPNN's weights toward the maximum margin solution. Further, we introduce variations of expressive $1$-WL-based kernel and MPNN architectures with provable generalization properties. Our empirical study confirms the validity of our theoretical findings.
LGNov 25, 2024
Even Sparser Graph TransformersHamed Shirzad, Honghao Lin, Balaji Venkatachalam et al.
Graph Transformers excel in long-range dependency modeling, but generally require quadratic memory complexity in the number of nodes in an input graph, and hence have trouble scaling to large graphs. Sparse attention variants such as Exphormer can help, but may require high-degree augmentations to the input graph for good performance, and do not attempt to sparsify an already-dense input graph. As the learned attention mechanisms tend to use few of these edges, such high-degree connections may be unnecessary. We show (empirically and with theoretical backing) that attention scores on graphs are usually quite consistent across network widths, and use this observation to propose a two-stage procedure, which we call Spexphormer: first, train a narrow network on the full augmented graph. Next, use only the active connections to train a wider network on a much sparser graph. We establish theoretical conditions when a narrow network's attention scores can match those of a wide network, and show that Spexphormer achieves good performance with drastically reduced memory requirements on various graph datasets.
LGNov 20, 2024
A Theory for Compressibility of Graph Transformers for Transductive LearningHamed Shirzad, Honghao Lin, Ameya Velingker et al.
Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks, as the independent and identically distributed (i.i.d.) assumption does not hold among samples. Instead, all train/test/validation samples are present during training, making them more akin to a semi-supervised task. These differences make the analysis of the models substantially different from other models. Recently, Graph Transformers have significantly improved results on these datasets by overcoming long-range dependency problems. However, the quadratic complexity of full Transformers has driven the community to explore more efficient variants, such as those with sparser attention patterns. While the attention matrix has been extensively discussed, the hidden dimension or width of the network has received less attention. In this work, we establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed. Our results apply to both sparse and dense variants of Graph Transformers.
MLDec 7, 2021
Private Robust Estimation by Stabilizing Convex RelaxationsPravesh K. Kothari, Pasin Manurangsi, Ameya Velingker
We give the first polynomial time and sample $(ε, δ)$-differentially private (DP) algorithm to estimate the mean, covariance and higher moments in the presence of a constant fraction of adversarial outliers. Our algorithm succeeds for families of distributions that satisfy two well-studied properties in prior works on robust estimation: certifiable subgaussianity of directional moments and certifiable hypercontractivity of degree 2 polynomials. Our recovery guarantees hold in the "right affine-invariant norms": Mahalanobis distance for mean, multiplicative spectral and relative Frobenius distance guarantees for covariance and injective norms for higher moments. Prior works obtained private robust algorithms for mean estimation of subgaussian distributions with bounded covariance. For covariance estimation, ours is the first efficient algorithm (even in the absence of outliers) that succeeds without any condition-number assumptions. Our algorithms arise from a new framework that provides a general blueprint for modifying convex relaxations for robust estimation to satisfy strong worst-case stability guarantees in the appropriate parameter norms whenever the algorithms produce witnesses of correctness in their run. We verify such guarantees for a modification of standard sum-of-squares (SoS) semidefinite programming relaxations for robust estimation. Our privacy guarantees are obtained by combining stability guarantees with a new "estimate dependent" noise injection mechanism in which noise scales with the eigenvalues of the estimated covariance. We believe this framework will be useful more generally in obtaining DP counterparts of robust estimators. Independently of our work, Ashtiani and Liaw [AL21] also obtained a polynomial time and sample private robust estimation algorithm for Gaussian distributions.
MLMar 21, 2020
Scaling up Kernel Ridge Regression via Locality Sensitive HashingMichael Kapralov, Navid Nouri, Ilya Razenshteyn et al.
Random binning features, introduced in the seminal paper of Rahimi and Recht (2007), are an efficient method for approximating a kernel matrix using locality sensitive hashing. Random binning features provide a very simple and efficient way of approximating the Laplace kernel but unfortunately do not apply to many important classes of kernels, notably ones that generate smooth Gaussian processes, such as the Gaussian kernel and Matern kernel. In this paper, we introduce a simple weighted version of random binning features and show that the corresponding kernel function generates Gaussian processes of any desired smoothness. We show that our weighted random binning features provide a spectral approximation to the corresponding kernel matrix, leading to efficient algorithms for kernel ridge regression. Experiments on large scale regression datasets show that our method outperforms the accuracy of random Fourier features method.
CRFeb 5, 2020
Pure Differentially Private Summation from Anonymous MessagesBadih Ghazi, Noah Golowich, Ravi Kumar et al.
The shuffled (aka anonymous) model has recently generated significant interest as a candidate distributed privacy framework with trust assumptions better than the central model but with achievable errors smaller than the local model. We study pure differentially private (DP) protocols in the shuffled model for summation, a basic and widely used primitive: - For binary summation where each of n users holds a bit as an input, we give a pure $ε$-DP protocol for estimating the number of ones held by the users up to an error of $O_ε(1)$, and each user sends $O_ε(\log n)$ messages each of 1 bit. This is the first pure protocol in the shuffled model with error $o(\sqrt{n})$ for constant $ε$. Using this protocol, we give a pure $ε$-DP protocol that performs summation of real numbers in $[0, 1]$ up to an error of $O_ε(1)$, and where each user sends $O_ε(\log^3 n)$ messages each of $O(\log\log n)$ bits. - In contrast, we show that for any pure $ε$-DP protocol for binary summation in the shuffled model having absolute error $n^{0.5-Ω(1)}$, the per user communication has to be at least $Ω_ε(\sqrt{\log n})$ bits. This implies the first separation between the (bounded-communication) multi-message shuffled model and the central model, and the first separation between pure and approximate DP protocols in the shuffled model. To prove our lower bound, we consider (a generalization of) the following question: given $γ$ in $(0, 1)$, what is the smallest m for which there are two random variables $X^0, X^1$ supported on $\{0, \dots ,m\}$ such that (i) the total variation distance between $X^0$ and $X^1$ is at least $1-γ$, and (ii) the moment generating functions of $X^0$ and $X^1$ are within a constant factor of each other everywhere? We show that the answer is $m = Θ(\sqrt{\log(1/γ)})$.
CRSep 24, 2019
Private Aggregation from Fewer Anonymous MessagesBadih Ghazi, Pasin Manurangsi, Rasmus Pagh et al.
Consider the setup where $n$ parties are each given a number $x_i \in \mathbb{F}_q$ and the goal is to compute the sum $\sum_i x_i$ in a secure fashion and with as little communication as possible. We study this problem in the anonymized model of Ishai et al. (FOCS 2006) where each party may broadcast anonymous messages on an insecure channel. We present a new analysis of the one-round "split and mix" protocol of Ishai et al. In order to achieve the same security parameter, our analysis reduces the required number of messages by a $Θ(\log n)$ multiplicative factor. We complement our positive result with lower bounds showing that the dependence of the number of messages on the domain size, the number of parties, and the security parameter is essentially tight. Using a reduction of Balle et al. (2019), our improved analysis of the protocol of Ishai et al. yields, in the same model, an $\left(\varepsilon, δ\right)$-differentially private protocol for aggregation that, for any constant $\varepsilon > 0$ and any $δ= \frac{1}{\mathrm{poly}(n)}$, incurs only a constant error and requires only a constant number of messages per party. Previously, such a protocol was known only for $Ω(\log n)$ messages per party.
CRAug 29, 2019
On the Power of Multiple Anonymous MessagesBadih Ghazi, Noah Golowich, Ravi Kumar et al.
An exciting new development in differential privacy is the shuffled model, in which an anonymous channel enables non-interactive, differentially private protocols with error much smaller than what is possible in the local model, while relying on weaker trust assumptions than in the central model. In this paper, we study basic counting problems in the shuffled model and establish separations between the error that can be achieved in the single-message shuffled model and in the shuffled model with multiple messages per user. For the problem of frequency estimation for $n$ users and a domain of size $B$, we obtain: - A nearly tight lower bound of $\tildeΩ( \min(\sqrt[4]{n}, \sqrt{B}))$ on the error in the single-message shuffled model. This implies that the protocols obtained from the amplification via shuffling work of Erlingsson et al. (SODA 2019) and Balle et al. (Crypto 2019) are essentially optimal for single-message protocols. A key ingredient in the proof is a lower bound on the error of locally-private frequency estimation in the low-privacy (aka high $ε$) regime. - Protocols in the multi-message shuffled model with $poly(\log{B}, \log{n})$ bits of communication per user and $poly\log{B}$ error, which provide an exponential improvement on the error compared to what is possible with single-message algorithms. For the related selection problem on a domain of size $B$, we prove: - A nearly tight lower bound of $Ω(B)$ on the number of users in the single-message shuffled model. This significantly improves on the $Ω(B^{1/17})$ lower bound obtained by Cheu et al. (Eurocrypt 2019), and when combined with their $\tilde{O}(\sqrt{B})$-error multi-message protocol, implies the first separation between single-message and multi-message protocols for this problem.
LGJun 19, 2019
Scalable and Differentially Private Distributed Aggregation in the Shuffled ModelBadih Ghazi, Rasmus Pagh, Ameya Velingker
Federated learning promises to make machine learning feasible on distributed, private datasets by implementing gradient descent using secure aggregation methods. The idea is to compute a global weight update without revealing the contributions of individual users. Current practical protocols for secure aggregation work in an "honest but curious" setting where a curious adversary observing all communication to and from the server cannot learn any private information assuming the server is honest and follows the protocol. A more scalable and robust primitive for privacy-preserving protocols is shuffling of user data, so as to hide the origin of each data item. Highly scalable and secure protocols for shuffling, so-called mixnets, have been proposed as a primitive for privacy-preserving analytics in the Encode-Shuffle-Analyze framework by Bittau et al., which was later analytically studied by Erlingsson et al. and Cheu et al.. The recent papers by Cheu et al., and Balle et al. have given protocols for secure aggregation that achieve differential privacy guarantees in this "shuffled model". Their protocols come at a cost, though: Either the expected aggregation error or the amount of communication per user scales as a polynomial $n^{Ω(1)}$ in the number of users $n$. In this paper we propose simple and more efficient protocol for aggregation in the shuffled model, where communication as well as error increases only polylogarithmically in $n$. Our new technique is a conceptual "invisibility cloak" that makes users' data almost indistinguishable from random noise while introducing zero distortion on the sum.
DSDec 20, 2018
A Universal Sampling Method for Reconstructing Signals with Simple Fourier TransformsHaim Avron, Michael Kapralov, Cameron Musco et al.
Reconstructing continuous signals from a small number of discrete samples is a fundamental problem across science and engineering. In practice, we are often interested in signals with 'simple' Fourier structure, such as bandlimited, multiband, and Fourier sparse signals. More broadly, any prior knowledge about a signal's Fourier power spectrum can constrain its complexity. Intuitively, signals with more highly constrained Fourier structure require fewer samples to reconstruct. We formalize this intuition by showing that, roughly, a continuous signal from a given class can be approximately reconstructed using a number of samples proportional to the *statistical dimension* of the allowed power spectrum of that class. Further, in nearly all settings, this natural measure tightly characterizes the sample complexity of signal reconstruction. Surprisingly, we also show that, up to logarithmic factors, a universal non-uniform sampling strategy can achieve this optimal complexity for *any class of signals*. We present a simple and efficient algorithm for recovering a signal from the samples taken. For bandlimited and sparse signals, our method matches the state-of-the-art. At the same time, it gives the first computationally and sample efficient solution to a broad range of problems, including multiband signal reconstruction and kriging and Gaussian process regression tasks in one dimension. Our work is based on a novel connection between randomized linear algebra and signal reconstruction with constrained Fourier structure. We extend tools based on statistical leverage score sampling and column-based matrix reconstruction to the approximation of continuous linear operators that arise in signal reconstruction. We believe that these extensions are of independent interest and serve as a foundation for tackling a broad range of continuous time problems using randomized methods.
LGApr 26, 2018
Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical GuaranteesHaim Avron, Michael Kapralov, Cameron Musco et al.
Random Fourier features is one of the most popular techniques for scaling up kernel methods, such as kernel ridge regression. However, despite impressive empirical results, the statistical properties of random Fourier features are still not well understood. In this paper we take steps toward filling this gap. Specifically, we approach random Fourier features from a spectral matrix approximation point of view, give tight bounds on the number of Fourier features required to achieve a spectral approximation, and show how spectral matrix approximation bounds imply statistical guarantees for kernel ridge regression. Qualitatively, our results are twofold: on the one hand, we show that random Fourier feature approximation can provably speed up kernel ridge regression under reasonable assumptions. At the same time, we show that the method is suboptimal, and sampling from a modified distribution in Fourier space, given by the leverage function of the kernel, yields provably better performance. We study this optimal sampling distribution for the Gaussian kernel, achieving a nearly complete characterization for the case of low-dimensional bounded datasets. Based on this characterization, we propose an efficient sampling scheme with guarantees superior to random Fourier features in this regime.