Edith Cohen

h-index38

25papers

338citations

Novelty58%

AI Score54

Ranked #30,140 of 201,326 authors (top 15%)#6,962 in LG (top 16%)

25 Papers

LGNov 11, 2022

Õptimal Differentially Private Learning of Thresholds and Quasi-Concave Optimization

Edith Cohen, Xin Lyu, Jelani Nelson et al.

The problem of learning threshold functions is a fundamental one in machine learning. Classical learning theory implies sample complexity of $O(ξ^{-1} \log(1/β))$ (for generalization error $ξ$ with confidence $1-β$). The private version of the problem, however, is more challenging and in particular, the sample complexity must depend on the size $|X|$ of the domain. Progress on quantifying this dependence, via lower and upper bounds, was made in a line of works over the past decade. In this paper, we finally close the gap for approximate-DP and provide a nearly tight upper bound of $\tilde{O}(\log^* |X|)$, which matches a lower bound by Alon et al (that applies even with improper learning) and improves over a prior upper bound of $\tilde{O}((\log^* |X|)^{1.5})$ by Kaplan et al. We also provide matching upper and lower bounds of $\tildeΘ(2^{\log^*|X|})$ for the additive error of private quasi-concave optimization (a related and more general problem). Our improvement is achieved via the novel Reorder-Slice-Compute paradigm for private data analysis which we believe will have further applications.

DSJul 3, 2022

Tricking the Hashing Trick: A Tight Lower Bound on the Robustness of CountSketch to Adaptive Inputs

Edith Cohen, Jelani Nelson, Tamás Sarlós et al.

CountSketch and Feature Hashing (the "hashing trick") are popular randomized dimensionality reduction methods that support recovery of $\ell_2$-heavy hitters (keys $i$ where $v_i^2 > ε\|\boldsymbol{v}\|_2^2$) and approximate inner products. When the inputs are {\em not adaptive} (do not depend on prior outputs), classic estimators applied to a sketch of size $O(\ell/ε)$ are accurate for a number of queries that is exponential in $\ell$. When inputs are adaptive, however, an adversarial input can be constructed after $O(\ell)$ queries with the classic estimator and the best known robust estimator only supports $\tilde{O}(\ell^2)$ queries. In this work we show that this quadratic dependence is in a sense inherent: We design an attack that after $O(\ell^2)$ queries produces an adversarial input vector whose sketch is highly biased. Our attack uses "natural" non-adaptive inputs (only the final adversarial input is chosen adaptively) and universally applies with any correct estimator, including one that is unknown to the attacker. In that, we expose inherent vulnerability of this fundamental method.

DSMay 13

Stochastic Matching via Local Sparsification

Sara Ahmadian, Edith Cohen, Mohammad Roghani

The classic online stochastic matching problem typically requires immediate and irrevocable matching decisions. However, in many modern decentralized systems such as real-time ride-hailing and distributed cloud computing, the primary bottleneck is often local communication bandwidth rather than the timing of the match itself. We formalize this challenge by introducing a two-stage local sparsification framework. In this setting, arriving requests must prune their realized compatibility sets to a strict budget of $k$ edges before a central coordinator optimizes the global matching. This creates a "middle ground" between local information constraints and global optimization utility. We propose a local selection strategy, parametrized by a fractional solution of the expected instance. Theoretically, we quantify the approximation ratio as a function of the solution's {\em spread}. We prove that under sufficient spread, our sparsifier globally preserves the expected size of the maximum matching. Empirically, we demonstrate the robustness of our approach using the New York City ride-hailing datasets and adversarial synthetic benchmarks. Our results show that near-optimal global matching is achievable even with highly constrained local budgets, significantly outperforming standard online baselines.

CRFeb 28, 2024

Lower Bounds for Differential Privacy Under Continual Observation and Online Threshold Queries

Edith Cohen, Xin Lyu, Jelani Nelson et al.

One of the most basic problems for studying the "price of privacy over time" is the so called private counter problem, introduced by Dwork et al. (2010) and Chan et al. (2010). In this problem, we aim to track the number of events that occur over time, while hiding the existence of every single event. More specifically, in every time step $t\in[T]$ we learn (in an online fashion) that $Δ_t\geq 0$ new events have occurred, and must respond with an estimate $n_t\approx\sum_{j=1}^t Δ_j$. The privacy requirement is that all of the outputs together, across all time steps, satisfy event level differential privacy. The main question here is how our error needs to depend on the total number of time steps $T$ and the total number of events $n$. Dwork et al. (2015) showed an upper bound of $O\left(\log(T)+\log^2(n)\right)$, and Henzinger et al. (2023) showed a lower bound of $Ω\left(\min\{\log n, \log T\}\right)$. We show a new lower bound of $Ω\left(\min\{n,\log T\}\right)$, which is tight w.r.t. the dependence on $T$, and is tight in the sparse case where $\log^2 n=O(\log T)$. Our lower bound has the following implications: $\bullet$ We show that our lower bound extends to the "online thresholds problem", where the goal is to privately answer many "quantile queries" when these queries are presented one-by-one. This resolves an open question of Bun et al. (2017). $\bullet$ Our lower bound implies, for the first time, a separation between the number of mistakes obtainable by a private online learner and a non-private online learner. This partially resolves a COLT'22 open question published by Sanyal and Ramponi. $\bullet$ Our lower bound also yields the first separation between the standard model of private online learning and a recently proposed relaxed variant of it, called private online prediction.

CLFeb 3, 2025

Scaling Embedding Layers in Language Models

Da Yu, Edith Cohen, Badih Ghazi et al.

We propose $SCONE$ ($S$calable, $C$ontextualized, $O$ffloaded, $N$-gram $E$mbedding), a new method for extending input embedding layers to enhance language model performance. To avoid increased decoding costs, $SCONE$ retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. After training, embeddings are precomputed and stored in off-accelerator memory; during inference, querying them has minimal impact on latency due to the low complexity of embedding lookups. $SCONE$ enables two new scaling strategies: increasing the number of n-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference (in terms of FLOPS and memory). We show that scaling both aspects enables a model with 1B accelerator-resident parameters to outperform a 1.9B-parameter baseline across diverse corpora, while using only about half the FLOPS and accelerator memory during inference.

LGDec 4, 2023

Hot PATE: Private Aggregation of Distributions for Diverse Task

Edith Cohen, Benjamin Cohen-Wang, Xin Lyu et al.

The Private Aggregation of Teacher Ensembles (PATE) framework enables privacy-preserving machine learning by aggregating responses from disjoint subsets of sensitive data. Adaptations of PATE to tasks with inherent output diversity such as text generation, where the desired output is a sample from a distribution, face a core tension: as diversity increases, samples from different teachers are less likely to agree, but lower agreement results in reduced utility for the same privacy requirements. Yet suppressing diversity to artificially increase agreement is undesirable, as it distorts the output of the underlying model, and thus reduces output quality. We propose Hot PATE, a variant of PATE designed for diverse generative settings. We formalize the notion of a diversity-preserving ensemble sampler and introduce an efficient sampler that provably transfers diversity without incurring additional privacy cost. Hot PATE requires only API access to proprietary models and can be used as a drop-in replacement for existing Cold PATE samplers. Our empirical evaluations corroborate and quantify the benefits, showing significant improvements in the privacy utility trade-off on evaluated in-context learning tasks, both in preserving diversity and in returning relevant responses.

LGJul 22, 2025

The Cost of Compression: Tight Quadratic Black-Box Attacks on Sketches for $\ell_2$ Norm Estimation

Sara Ahmadian, Edith Cohen, Uri Stemmer

Dimensionality reduction via linear sketching is a powerful and widely used technique, but it is known to be vulnerable to adversarial inputs. We study the black-box adversarial setting, where a fixed, hidden sketching matrix $A \in R^{k \times n}$ maps high-dimensional vectors $v \in R^n$ to lower-dimensional sketches $A v \in R^k$, and an adversary can query the system to obtain approximate $\ell_2$-norm estimates that are computed from the sketch. We present a universal, nonadaptive attack that, using $\tilde{O}(k^2)$ queries, either causes a failure in norm estimation or constructs an adversarial input on which the optimal estimator for the query distribution (used by the attack) fails. The attack is completely agnostic to the sketching matrix and to the estimator: it applies to any linear sketch and any query responder, including those that are randomized, adaptive, or tailored to the query distribution. Our lower bound construction tightly matches the known upper bounds of $\tildeΩ(k^2)$, achieved by specialized estimators for Johnson Lindenstrauss transforms and AMS sketches. Beyond sketching, our results uncover structural parallels to adversarial attacks in image classification, highlighting fundamental vulnerabilities of compressed representations.

DSJul 18, 2025

Tight Bounds for Answering Adaptively Chosen Concentrated Queries

Emma Rapoport, Edith Cohen, Uri Stemmer

Most work on adaptive data analysis assumes that samples in the dataset are independent. When correlations are allowed, even the non-adaptive setting can become intractable, unless some structural constraints are imposed. To address this, Bassily and Freund [2016] introduced the elegant framework of concentrated queries, which requires the analyst to restrict itself to queries that are concentrated around their expected value. While this assumption makes the problem trivial in the non-adaptive setting, in the adaptive setting it remains quite challenging. In fact, all known algorithms in this framework support significantly fewer queries than in the independent case: At most $O(n)$ queries for a sample of size $n$, compared to $O(n^2)$ in the independent setting. In this work, we prove that this utility gap is inherent under the current formulation of the concentrated queries framework, assuming some natural conditions on the algorithm. Additionally, we present a simplified version of the best-known algorithms that match our impossibility result.

LGJun 5, 2025

Urania: Differentially Private Insights into AI Use

Daogao Liu, Edith Cohen, Badih Ghazi et al.

We introduce $Urania$, a novel framework for generating insights about LLM chatbot interactions with rigorous differential privacy (DP) guarantees. The framework employs a private clustering mechanism and innovative keyword extraction methods, including frequency-based, TF-IDF-based, and LLM-guided approaches. By leveraging DP tools such as clustering, partition selection, and histogram-based summarization, $Urania$ provides end-to-end privacy protection. Our evaluation assesses lexical and semantic content preservation, pair similarity, and LLM-based metrics, benchmarking against a non-private Clio-inspired pipeline (Tamkin et al., 2024). Moreover, we develop a simple empirical privacy evaluation that demonstrates the enhanced robustness of our DP pipeline. The results show the framework's ability to extract meaningful conversational insights while maintaining stringent user privacy, effectively balancing data utility with privacy preservation.

DSFeb 28, 2022

On the Robustness of CountSketch to Adaptive Inputs

Edith Cohen, Xin Lyu, Jelani Nelson et al.

CountSketch is a popular dimensionality reduction technique that maps vectors to a lower dimension using randomized linear measurements. The sketch supports recovering $\ell_2$-heavy hitters of a vector (entries with $v[i]^2 \geq \frac{1}{k}\|\boldsymbol{v}\|^2_2$). We study the robustness of the sketch in adaptive settings where input vectors may depend on the output from prior inputs. Adaptive settings arise in processes with feedback or with adversarial attacks. We show that the classic estimator is not robust, and can be attacked with a number of queries of the order of the sketch size. We propose a robust estimator (for a slightly modified sketch) that allows for quadratic number of queries in the sketch size, which is an improvement factor of $\sqrt{k}$ (for $k$ heavy hitters) over prior work.

LGDec 29, 2021

Differentially-Private Clustering of Easy Instances

Edith Cohen, Haim Kaplan, Yishay Mansour et al.

Clustering is a fundamental problem in data analysis. In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points. Despite significant research progress, the problem had so far resisted practical solutions. In this work we aim at providing simple implementable differentially private clustering algorithms that provide utility when the data is "easy," e.g., when there exists a significant separation between the clusters. We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results. We are able to get improved sample complexity bounds in some cases of Gaussian mixtures and $k$-means. We complement our theoretical analysis with an empirical evaluation on synthetic data.

LGOct 19, 2021

FriendlyCore: Practical Differentially Private Aggregation

Eliad Tsfadia, Edith Cohen, Haim Kaplan et al.

Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or to the large number of data points that is required for accurate results. We propose a simple and practical tool, $\mathsf{FriendlyCore}$, that takes a set of points ${\cal D}$ from an unrestricted (pseudo) metric space as input. When ${\cal D}$ has effective diameter $r$, $\mathsf{FriendlyCore}$ returns a "stable" subset ${\cal C} \subseteq {\cal D}$ that includes all points, except possibly few outliers, and is {\em certified} to have diameter $r$. $\mathsf{FriendlyCore}$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy. Surprisingly, $\mathsf{FriendlyCore}$ is light-weight with no dependence on the dimension. We empirically demonstrate its advantages in boosting the accuracy of mean estimation and clustering tasks such as $k$-means and $k$-GMM, outperforming tailored methods.

DSJul 30, 2021

A Framework for Adversarial Streaming via Differential Privacy and Difference Estimators

Idan Attias, Edith Cohen, Moshe Shechner et al.

Classical streaming algorithms operate under the (not always reasonable) assumption that the input stream is fixed in advance. Recently, there is a growing interest in designing robust streaming algorithms that provide provable guarantees even when the input stream is chosen adaptively as the execution progresses. We propose a new framework for robust streaming that combines techniques from two recently suggested frameworks by Hassidim et al. [NeurIPS 2020] and by Woodruff and Zhou [FOCS 2021]. These recently suggested frameworks rely on very different ideas, each with its own strengths and weaknesses. We combine these two frameworks into a single hybrid framework that obtains the ``best of both worlds'', thereby solving a question left open by Woodruff and Zhou.

LGOct 25, 2020

Differentially Private Weighted Sampling

Edith Cohen, Ofir Geri, Tamas Sarlos et al.

Common datasets have the form of elements with keys (e.g., transactions and products) and the goal is to perform analytics on the aggregated form of key and frequency pairs. A weighted sample of keys by (a function of) frequency is a highly versatile summary that provides a sparse set of representative keys and supports approximate evaluations of query statistics. We propose private weighted sampling (PWS): A method that ensures element-level differential privacy while retaining, to the extent possible, the utility of a respective non-private weighted sample. PWS maximizes the reporting probabilities of keys and estimation quality of a broad family of statistics. PWS improves over the state of the art also for the well-studied special case of private histograms, when no sampling is performed. We empirically demonstrate significant performance gains compared with prior baselines: 20%-300% increase in key reporting for common Zipfian frequency distributions and accuracy for $\times 2$-$ 8$ lower frequencies in estimation tasks. Moreover, PWS is applied as a simple post-processing of a non-private sample, without requiring the original data. This allows for seamless integration with existing implementations of non-private schemes and retaining the efficiency of schemes designed for resource-constrained settings such as massive distributed or streamed data. We believe that due to practicality and performance, PWS may become a method of choice in applications where privacy is desired.

LGJul 14, 2020

WOR and $p$'s: Sketches for $\ell_p$-Sampling Without Replacement

Edith Cohen, Rasmus Pagh, David P. Woodruff

Weighted sampling is a fundamental tool in data analysis and machine learning pipelines. Samples are used for efficient estimation of statistics or as sparse representations of the data. When weight distributions are skewed, as is often the case in practice, without-replacement (WOR) sampling is much more effective than with-replacement (WR) sampling: it provides a broader representation and higher accuracy for the same number of samples. We design novel composable sketches for WOR $\ell_p$ sampling, weighted sampling of keys according to a power $p\in[0,2]$ of their frequency (or for signed data, sum of updates). Our sketches have size that grows only linearly with the sample size. Our design is simple and practical, despite intricate analysis, and based on off-the-shelf use of widely implemented heavy hitters sketches such as CountSketch. Our method is the first to provide WOR sampling in the important regime of $p>1$ and the first to handle signed updates for $p>0$.

LGMay 31, 2020

Graph Learning with Loss-Guided Training

Eliav Buchnik, Edith Cohen

Classically, ML models trained with stochastic gradient descent (SGD) are designed to minimize the average loss per example and use a distribution of training examples that remains {\em static} in the course of training. Research in recent years demonstrated, empirically and theoretically, that significant acceleration is possible by methods that dynamically adjust the training distribution in the course of training so that training is more focused on examples with higher loss. We explore {\em loss-guided training} in a new domain of node embedding methods pioneered by {\sc DeepWalk}. These methods work with implicit and large set of positive training examples that are generated using random walks on the input graph and therefore are not amenable for typical example selection methods. We propose computationally efficient methods that allow for loss-guided training in this framework. Our empirical evaluation on a rich collection of datasets shows significant acceleration over the baseline static methods, both in terms of total training performed and overall computation.

LGJul 31, 2019

Sample Complexity Bounds for Influence Maximization

Gal Sadeh, Edith Cohen, Haim Kaplan

Influence maximization (IM) is the problem of finding for a given $s\geq 1$ a set $S$ of $|S|=s$ nodes in a network with maximum influence. With stochastic diffusion models, the influence of a set $S$ of seed nodes is defined as the expectation of its reachability over simulations, where each simulation specifies a deterministic reachability function. Two well-studied special cases are the Independent Cascade (IC) and the Linear Threshold (LT) models of Kempe, Kleinberg, and Tardos. The influence function in stochastic diffusion is unbiasedly estimated by averaging reachability values over i.i.d. simulations. We study the IM sample complexity: the number of simulations needed to determine a $(1-ε)$-approximate maximizer with confidence $1-δ$. Our main result is a surprising upper bound of $O( s τε^{-2} \ln \frac{n}δ)$ for a broad class of models that includes IC and LT models and their mixtures, where $n$ is the number of nodes and $τ$ is the number of diffusion steps. Generally $τ\ll n$, so this significantly improves over the generic upper bound of $O(s n ε^{-2} \ln \frac{n}δ)$. Our sample complexity bounds are derived from novel upper bounds on the variance of the reachability that allow for small relative error for influential sets and additive error when influence is small. Moreover, we provide a data-adaptive method that can detect and utilize fewer simulations on models where it suffices. Finally, we provide an efficient greedy design that computes an $(1-1/e-ε)$-approximate maximizer from simulations and applies to any submodular stochastic diffusion model that satisfies the variance bounds.

LGMar 14, 2018

Eliav Buchnik, Edith Cohen, Avinatan Hassidim et al.

Optimization of machine learning models is commonly performed through stochastic gradient updates on randomly ordered training examples. This practice means that sub-epochs comprise of independent random samples of the training data that may not preserve informative structure present in the full data. We hypothesize that the training can be more effective with {\em self-similar} arrangements that potentially allow each epoch to provide benefits of multiple ones. We study this for "matrix factorization" -- the common task of learning metric embeddings of entities such as queries, videos, or words from example pairwise associations. We construct arrangements that preserve the weighted Jaccard similarities of rows and columns and experimentally observe training acceleration of 3\%-37\% on synthetic and recommendation datasets. Principled arrangements of training examples emerge as a novel and potentially powerful enhancement to SGD that merits further exploration.

LGJun 12, 2017

Clustering Small Samples with Quality Guarantees: Adaptivity with One2all pps

Edith Cohen, Shiri Chechik, Haim Kaplan

Clustering of data points is a fundamental tool in data analysis. We consider points $X$ in a relaxed metric space, where the triangle inequality holds within a constant factor. The {\em cost} of clustering $X$ by $Q$ is $V(Q)=\sum_{x\in X} d_{xQ}$. Two basic tasks, parametrized by $k \geq 1$, are {\em cost estimation}, which returns (approximate) $V(Q)$ for queries $Q$ such that $|Q|=k$ and {\em clustering}, which returns an (approximate) minimizer of $V(Q)$ of size $|Q|=k$. With very large data sets $X$, we seek efficient constructions of small samples that act as surrogates to the full data for performing these tasks. Existing constructions that provide quality guarantees are either worst-case, and unable to benefit from structure of real data sets, or make explicit strong assumptions on the structure. We show here how to avoid both these pitfalls using adaptive designs. At the core of our design is the {\em one2all} construction of multi-objective probability-proportional-to-size (pps) samples: Given a set $M$ of centroids and $α\geq 1$, one2all efficiently assigns probabilities to points so that the clustering cost of {\em each} $Q$ with cost $V(Q) \geq V(M)/α$ can be estimated well from a sample of size $O(α|M|ε^{-2})$. For cost queries, we can obtain worst-case sample size $O(kε^{-2})$ by applying one2all to a bicriteria approximation $M$, but we adaptively balance $|M|$ and $α$ to further reduce sample size. For clustering, we design an adaptive wrapper that applies a base clustering algorithm to a sample $S$. Our wrapper uses the smallest sample that provides statistical guarantees that the quality of the clustering on the sample carries over to the full data set. We demonstrate experimentally the huge gains of using our adaptive instead of worst-case methods.

LGMar 7, 2017

Bootstrapped Graph Diffusions: Exposing the Power of Nonlinearity

Eliav Buchnik, Edith Cohen

Graph-based semi-supervised learning (SSL) algorithms predict labels for all nodes based on provided labels of a small set of seed nodes. Classic methods capture the graph structure through some underlying diffusion process that propagates through the graph edges. Spectral diffusion, which includes personalized page rank and label propagation, propagates through random walks. Social diffusion propagates through shortest paths. A common ground to these diffusions is their {\em linearity}, which does not distinguish between contributions of few "strong" relations and many "weak" relations. Recently, non-linear methods such as node embeddings and graph convolutional networks (GCN) demonstrated a large gain in quality for SSL tasks. These methods introduce multiple components and greatly vary on how the graph structure, seed label information, and other features are used. We aim here to study the contribution of non-linearity, as an isolated ingredient, to the performance gain. To do so, we place classic linear graph diffusions in a self-training framework. Surprisingly, we observe that SSL using the resulting {\em bootstrapped diffusions} not only significantly improves over the respective non-bootstrapped baselines but also outperform state-of-the-art non-linear SSL methods. Moreover, since the self-training wrapper retains the scalability of the base method, we obtain both higher quality and better scalability.

CRDec 8, 2016

Efficient Distinct Heavy Hitters for DNS DDoS Attack Detection

Yehuda Afek, Anat Bremler-Barr, Edith Cohen et al.

Motivated by a recent new type of randomized Distributed Denial of Service (DDoS) attacks on the Domain Name Service (DNS), we develop novel and efficient distinct heavy hitters algorithms and build an attack identification system that uses our algorithms. Heavy hitter detection in streams is a fundamental problem with many applications, including detecting certain DDoS attacks and anomalies. A (classic) heavy hitter (HH) in a stream of elements is a key (e.g., the domain of a query) which appears in many elements (e.g., requests). When stream elements consist of a <key; subkey> pairs, (<domain; subdomain>) a distinct heavy hitter (dhh) is a key that is paired with a large number of different subkeys. Our dHH algorithms are considerably more practical than previous algorithms. Specifically the new fixed-size algorithms are simple to code and with asymptotically optimal space accuracy tradeoffs. In addition we introduce a new measure, a combined heavy hitter (cHH), which is a key with a large combination of distinct and classic weights. Efficient algorithms are also presented for cHH detection. Finally, we perform extensive experimental evaluation on real DNS attack traces, demonstrating the effectiveness of both our algorithms and our DNS malicious queries identification system.

DSJul 21, 2016

HyperLogLog Hyper Extended: Sketches for Concave Sublinear Frequency Statistics

Edith Cohen

One of the most common statistics computed over data elements is the number of distinct keys. A thread of research pioneered by Flajolet and Martin three decades ago culminated in the design of optimal approximate counting sketches, which have size that is double logarithmic in the number of distinct keys and provide estimates with a small relative error. Moreover, the sketches are composable, and thus suitable for streamed, parallel, or distributed computation. We consider here all statistics of the frequency distribution of keys, where a contribution of a key to the aggregate is concave and grows (sub)linearly with its frequency. These fundamental aggregations are very common in text, graphs, and logs analysis and include logarithms, low frequency moments, and capping statistics. We design composable sketches of double-logarithmic size for all concave sublinear statistics. Our design combines theoretical optimality and practical simplicity. In a nutshell, we specify tailored mapping functions of data elements to output elements so that our target statistics on the data elements is approximated by the (max-) distinct statistics of the output elements, which can be approximated using off-the-shelf sketches. Our key insight is relating these target statistics to the {\em complement Laplace} transform of the input frequencies.

LGMar 30, 2016

Semi-Supervised Learning on Graphs through Reach and Distance Diffusion

Edith Cohen

Semi-supervised learning (SSL) is an indispensable tool when there are few labeled entities and many unlabeled entities for which we want to predict labels. With graph-based methods, entities correspond to nodes in a graph and edges represent strong relations. At the heart of SSL algorithms is the specification of a dense {\em kernel} of pairwise affinity values from the graph structure. A learning algorithm is then trained on the kernel together with labeled entities. The most popular kernels are {\em spectral} and include the highly scalable "symmetric" Laplacian methods, that compute a soft labels using Jacobi iterations, and "asymmetric" methods including Personalized Page Rank (PPR) which use short random walks and apply with directed relations, such as like, follow, or hyperlinks. We introduce {\em Reach diffusion} and {\em Distance diffusion} kernels that build on powerful social and economic models of centrality and influence in networks and capture the directed pairwise relations that underline social influence. Inspired by the success of social influence as an alternative to spectral centrality such as Page Rank, we explore SSL with our kernels and develop highly scalable algorithms for parameter setting, label learning, and sampling. We perform preliminary experiments that demonstrate the properties and potential of our kernels.

SIMar 30, 2015

Average Distance Queries through Weighted Samples in Graphs and Metric Spaces: High Scalability with Tight Statistical Guarantees

Shiri Chechik, Edith Cohen, Haim Kaplan

The average distance from a node to all other nodes in a graph, or from a query point in a metric space to a set of points, is a fundamental quantity in data analysis. The inverse of the average distance, known as the (classic) closeness centrality of a node, is a popular importance measure in the study of social networks. We develop novel structural insights on the sparsifiability of the distance relation via weighted sampling. Based on that, we present highly practical algorithms with strong statistical guarantees for fundamental problems. We show that the average distance (and hence the centrality) for all nodes in a graph can be estimated using $O(ε^{-2})$ single-source distance computations. For a set $V$ of $n$ points in a metric space, we show that after preprocessing which uses $O(n)$ distance computations we can compute a weighted sample $S\subset V$ of size $O(ε^{-2})$ such that the average distance from any query point $v$ to $V$ can be estimated from the distances from $v$ to $S$. Finally, we show that for a set of points $V$ in a metric space, we can estimate the average pairwise distance using $O(n+ε^{-2})$ distance computations. The estimate is based on a weighted sample of $O(ε^{-2})$ pairs of points, which is computed using $O(n)$ distance computations. Our estimates are unbiased with normalized mean square error (NRMSE) of at most $ε$. Increasing the sample size by a $O(\log n)$ factor ensures that the probability that the relative error exceeds $ε$ is polynomially small.

IRFeb 20, 2015

Stream Sampling for Frequency Cap Statistics

Edith Cohen

Unaggregated data, in streamed or distributed form, is prevalent and come from diverse application domains which include interactions of users with web services and IP traffic. Data elements have {\em keys} (cookies, users, queries) and elements with different keys interleave. Analytics on such data typically utilizes statistics stated in terms of the frequencies of keys. The two most common statistics are {\em distinct}, which is the number of active keys in a specified segment, and {\em sum}, which is the sum of the frequencies of keys in the segment. Both are special cases of {\em cap} statistics, defined as the sum of frequencies {\em capped} by a parameter $T$, which are popular in online advertising platforms. Aggregation by key, however, is costly, requiring state proportional to the number of distinct keys, and therefore we are interested in estimating these statistics or more generally, sampling the data, without aggregation. We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size. Our design provides the first effective solution for general frequency cap statistics. Our $\ell$-capped samples provide estimates with tight statistical guarantees for cap statistics with $T=Θ(\ell)$ and nonnegative unbiased estimates of {\em any} monotone non-decreasing frequency statistics. An added benefit of our unified design is facilitating {\em multi-objective samples}, which provide estimates with statistical guarantees for a specified set of different statistics, using a single, smaller sample.