Jiaming Xu

h-index26

45papers

2,476citations

Novelty59%

AI Score46

Ranked #36,762 of 194,257 authors (top 19%)#462 in ML (top 14%)

45 Papers

13.0LGNov 28, 2023

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Jinhao Li, Jiaming Xu, Shiyao Li et al.

Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. Many previous studies exploit quantization methods to reduce LLM inference cost by reducing latency and memory consumption. Applying 2-bit single-precision weight quantization brings >3% accuracy loss, so the state-of-the-art methods use mixed-precision methods for LLMs (e.g. Llama2-7b, etc.) to improve the accuracy. However, challenges still exist: (1) Uneven distribution in weight matrix. (2) Large speed degradation by adding sparse outliers. (3) Time-consuming dequantization operations on GPUs. To tackle these challenges and enable fast and efficient LLM inference on GPUs, we propose the following techniques in this paper. (1) Intra-weight mixed-precision quantization. (2) Exclusive 2-bit sparse outlier with minimum speed degradation. (3) Asynchronous dequantization. We conduct extensive experiments on different model families (e.g. Llama3, etc.) and model sizes. We achieve 2.91-bit for each weight considering all scales/zeros for different models with negligible loss. As a result, with our 2/4/16 mixed-precision quantization for each weight matrix and asynchronous dequantization during inference, our design achieves an end-to-end speedup for Llama2-7b is 1.74x over the original model, and we reduce both runtime cost and total cost by up to 2.53x and 2.29x with less GPU requirements.

9.6LGJun 15, 2022

Global Convergence of Federated Learning for Mixed Regression

Lili Su, Jiaming Xu, Pengkun Yang

This paper studies the problem of model training under Federated Learning when clients exhibit cluster structure. We contextualize this problem in mixed regression, where each client has limited local data generated from one of $k$ unknown regression models. We design an algorithm that achieves global convergence from any initialization, and works even when local data volume is highly unbalanced -- there could exist clients that contain $O(1)$ data points only. Our algorithm first runs moment descent on a few anchor clients (each with $\tildeΩ(k)$ data points) to obtain coarse model estimates. Then each client alternately estimates its cluster labels and refines the model estimates based on FedAvg or FedProx. A key innovation in our analysis is a uniform estimate on the clustering errors, which we prove by bounding the VC dimension of general polynomial concept classes based on the theory of algebraic geometry.

11.3ARSep 16, 2024

MARCA: Mamba Accelerator with ReConfigurable Architecture

Jinhao Li, Shan Huang, Jiaming Xu et al.

We propose a Mamba accelerator with reconfigurable architecture, MARCA.We propose three novel approaches in this paper. (1) Reduction alternative PE array architecture for both linear and element-wise operations. For linear operations, the reduction tree connected to PE arrays is enabled and executes the reduction operation. For element-wise operations, the reduction tree is disabled and the output bypasses. (2) Reusable nonlinear function unit based on the reconfigurable PE. We decompose the exponential function into element-wise operations and a shift operation by a fast biased exponential algorithm, and the activation function (SiLU) into a range detection and element-wise operations by a piecewise approximation algorithm. Thus, the reconfigurable PEs are reused to execute nonlinear functions with negligible accuracy loss.(3) Intra-operation and inter-operation buffer management strategy. We propose intra-operation buffer management strategy to maximize input data sharing for linear operations within operations, and inter-operation strategy for element-wise operations between operations. We conduct extensive experiments on Mamba model families with different sizes.MARCA achieves up to 463.22$\times$/11.66$\times$ speedup and up to 9761.42$\times$/242.52$\times$ energy efficiency compared to Intel Xeon 8358P CPU and NVIDIA Tesla A100 GPU implementations, respectively.

9.6AINov 30, 2025

SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs

Jiaming Xu, Jiayi Pan, Hanzhen Wang et al.

In this paper, we point out that the objective of the retrieval algorithms is to align with the LLM, which is similar to the objective of knowledge distillation in LLMs. We analyze the similarity in information focus between the distilled language model(DLM) and the original LLM from the perspective of information theory, and thus propose a novel paradigm that leverages a DLM as the retrieval algorithm. Based on the insight, we present SpeContext, an algorithm and system co-design for long-context reasoning. (1) At the algorithm level, SpeContext proposes lightweight retrieval head based on the head-level attention weights of DLM, achieving > 90% parameters reduction by pruning the redundancy. (2) At the system level, SpeContext designs an asynchronous prefetch dataflow via the elastic loading strategy, effectively overlapping KV cache retrieval with the LLM computation. (3) At the compilation level, SpeContext constructs the theoretical memory model and implements an adaptive memory management system to achieve acceleration by maximizing GPU memory utilization. We deploy and evaluate SpeContext in two resourceconstrained environments, cloud and edge. Extensive experiments show that, compared with the Huggingface framework, SpeContext achieves up to 24.89x throughput improvement in cloud and 10.06x speedup in edge with negligible accuracy loss, pushing the Pareto frontier of accuracy and throughput.

14.5ROMar 30, 2025

Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models

Haonan Chen, Jiaming Xu, Lily Sheng et al.

When performing tasks like laundry, humans naturally coordinate both hands to manipulate objects and anticipate how their actions will change the state of the clothes. However, achieving such coordination in robotics remains challenging due to the need to model object movement, predict future states, and generate precise bimanual actions. In this work, we address these challenges by infusing the predictive nature of human manipulation strategies into robot imitation learning. Specifically, we disentangle task-related state transitions from agent-specific inverse dynamics modeling to enable effective bimanual coordination. Using a demonstration dataset, we train a diffusion model to predict future states given historical observations, envisioning how the scene evolves. Then, we use an inverse dynamics model to compute robot actions that achieve the predicted states. Our key insight is that modeling object movement can help learning policies for bimanual coordination manipulation tasks. Evaluating our framework across diverse simulation and real-world manipulation setups, including multimodal goal configurations, bimanual manipulation, deformable objects, and multi-object setups, we find that it consistently outperforms state-of-the-art state-to-action mapping policies. Our method demonstrates a remarkable capacity to navigate multimodal goal configurations and action distributions, maintain stability across different control modes, and synthesize a broader range of behaviors than those present in the demonstration dataset.

8.4CVSep 17, 2025

SpecDiff: Accelerating Diffusion Model Inference with Self-Speculation

Jiayi Pan, Jiaming Xu, Yongkang Zhou et al.

Feature caching has recently emerged as a promising method for diffusion model acceleration. It effectively alleviates the inefficiency problem caused by high computational requirements by caching similar features in the inference process of the diffusion model. In this paper, we analyze existing feature caching methods from the perspective of information utilization, and point out that relying solely on historical information will lead to constrained accuracy and speed performance. And we propose a novel paradigm that introduces future information via self-speculation based on the information similarity at the same time step across different iteration times. Based on this paradigm, we present \textit{SpecDiff}, a training-free multi-level feature caching strategy including a cached feature selection algorithm and a multi-level feature classification algorithm. (1) Feature selection algorithm based on self-speculative information. \textit{SpecDiff} determines a dynamic importance score for each token based on self-speculative information and historical information, and performs cached feature selection through the importance score. (2) Multi-level feature classification algorithm based on feature importance scores. \textit{SpecDiff} classifies tokens by leveraging the differences in feature importance scores and introduces a multi-level feature calculation strategy. Extensive experiments show that \textit{SpecDiff} achieves average 2.80 \times, 2.74 \times , and 3.17\times speedup with negligible quality loss in Stable Diffusion 3, 3.5, and FLUX compared to RFlow on NVIDIA A800-80GB GPU. By merging speculative and historical information, \textit{SpecDiff} overcomes the speedup-accuracy trade-off bottleneck, pushing the Pareto frontier of speedup and accuracy in the efficient diffusion model inference.

7.7LGMay 31, 2023

Federated Learning in the Presence of Adversarial Client Unavailability

Lili Su, Ming Xiang, Jiaming Xu et al.

Federated learning is a decentralized machine learning framework that enables collaborative model training without revealing raw data. Due to the diverse hardware and software limitations, a client may not always be available for the computation requests from the parameter server. An emerging line of research is devoted to tackling arbitrary client unavailability. However, existing work still imposes structural assumptions on the unavailability patterns, impeding their applicability in challenging scenarios wherein the unavailability patterns are beyond the control of the parameter server. Moreover, in harsh environments like battlefields, adversaries can selectively and adaptively silence specific clients. In this paper, we relax the structural assumptions and consider adversarial client unavailability. To quantify the degrees of client unavailability, we use the notion of $ε$-adversary dropout fraction. We show that simple variants of FedAvg or FedProx, albeit completely agnostic to $ε$, converge to an estimation error on the order of $ε(G^2 + σ^2)$ for non-convex global objectives and $ε(G^2 + σ^2)/μ^2$ for $μ$ strongly convex global objectives, where $G$ is a heterogeneity parameter and $σ^2$ is the noise level. Conversely, we prove that any algorithm has to suffer an estimation error of at least $ε(G^2 + σ^2)/8$ and $ε(G^2 + σ^2)/(8μ^2)$ for non-convex global objectives and $μ$-strongly convex global objectives. Furthermore, the convergence speeds of the FedAvg or FedProx variants are $O(1/\sqrt{T})$ for non-convex objectives and $O(1/T)$ for strongly-convex objectives, both of which are the best possible for any first-order method that only has access to noisy gradients.

10.3STFeb 22, 2022

Random Graph Matching in Geometric Models: the Case of Complete Graphs

Haoyu Wang, Yihong Wu, Jiaming Xu et al.

This paper studies the problem of matching two complete graphs with edge weights correlated through latent geometries, extending a recent line of research on random graph matching with independent edge weights to geometric models. Specifically, given a random permutation $π^*$ on $[n]$ and $n$ iid pairs of correlated Gaussian vectors $\{X_{π^*(i)}, Y_i\}$ in $\mathbb{R}^d$ with noise parameter $σ$, the edge weights are given by $A_{ij}=κ(X_i,X_j)$ and $B_{ij}=κ(Y_i,Y_j)$ for some link function $κ$. The goal is to recover the hidden vertex correspondence $π^*$ based on the observation of $A$ and $B$. We focus on the dot-product model with $κ(x,y)=\langle x, y \rangle$ and Euclidean distance model with $κ(x,y)=\|x-y\|^2$, in the low-dimensional regime of $d=o(\log n)$ wherein the underlying geometric structures are most evident. We derive an approximate maximum likelihood estimator, which provably achieves, with high probability, perfect recovery of $π^*$ when $σ=o(n^{-2/d})$ and almost perfect recovery with a vanishing fraction of errors when $σ=o(n^{-1/d})$. Furthermore, these conditions are shown to be information-theoretically optimal even when the latent coordinates $\{X_i\}$ and $\{Y_i\}$ are observed, complementing the recent results of [DCK19] and [KNW22] in geometric models of the planted bipartite matching problem. As a side discovery, we show that the celebrated spectral algorithm of [Ume88] emerges as a further approximation to the maximum likelihood in the geometric model.

11.7STOct 22, 2021

Testing network correlation efficiently via counting trees

Cheng Mao, Yihong Wu, Jiaming Xu et al.

We propose a new procedure for testing whether two networks are edge-correlated through some latent vertex correspondence. The test statistic is based on counting the co-occurrences of signed trees for a family of non-isomorphic trees. When the two networks are Erdős-Rényi random graphs $\mathcal{G}(n,q)$ that are either independent or correlated with correlation coefficient $ρ$, our test runs in $n^{2+o(1)}$ time and succeeds with high probability as $n\to\infty$, provided that $n\min\{q,1-q\} \ge n^{-o(1)}$ and $ρ^2>α\approx 0.338$, where $α$ is Otter's constant so that the number of unlabeled trees with $K$ edges grows as $(1/α)^K$. This significantly improves the prior work in terms of statistical accuracy, running time, and graph sparsity.

13.8MLJun 29, 2021

A Non-parametric View of FedAvg and FedProx: Beyond Stationary Points

Lili Su, Jiaming Xu, Pengkun Yang

Federated Learning (FL) is a promising decentralized learning framework and has great potentials in privacy preservation and in lowering the computation load at the cloud. Recent work showed that FedAvg and FedProx - the two widely-adopted FL algorithms - fail to reach the stationary points of the global optimization objective even for homogeneous linear regression problems. Further, it is concerned that the common model learned might not generalize well locally at all in the presence of heterogeneity. In this paper, we analyze the convergence and statistical efficiency of FedAvg and FedProx, addressing the above two concerns. Our analysis is based on the standard non-parametric regression in a reproducing kernel Hilbert space (RKHS), and allows for heterogeneous local data distributions and unbalanced local datasets. We prove that the estimation errors, measured in either the empirical norm or the RKHS norm, decay with a rate of 1/t in general and exponentially for finite-rank kernels. In certain heterogeneous settings, these upper bounds also imply that both FedAvg and FedProx achieve the optimal error rate. To further analytically quantify the impact of the heterogeneity at each client, we propose and characterize a novel notion-federation gain, defined as the reduction of the estimation error for a client to join the FL. We discover that when the data heterogeneity is moderate, a client with limited local data can benefit from a common model with a large federation gain. Numerical experiments further corroborate our theoretical findings.

6.3MLMay 1, 2021

One-pass Stochastic Gradient Descent in Overparametrized Two-layer Neural Networks

Jiaming Xu, Hanjing Zhu

There has been a recent surge of interest in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks. Most previous works assume that the training data is provided a priori in a batch, while less attention has been paid to the important setting where the training data arrives in a stream. In this paper, we study the streaming data setup and show that with overparamterization and random initialization, the prediction error of two-layer neural networks under one-pass SGD converges in expectation. The convergence rate depends on the eigen-decomposition of the integral operator associated with the so-called neural tangent kernel (NTK). A key step of our analysis is to show a random kernel function converges to the NTK with high probability using the VC dimension and McDiarmid's inequality.

7.3STMar 17, 2021

The planted matching problem: Sharp threshold and infinite-order phase transition

Jian Ding, Yihong Wu, Jiaming Xu et al.

We study the problem of reconstructing a perfect matching $M^*$ hidden in a randomly weighted $n\times n$ bipartite graph. The edge set includes every node pair in $M^*$ and each of the $n(n-1)$ node pairs not in $M^*$ independently with probability $d/n$. The weight of each edge $e$ is independently drawn from the distribution $\mathcal{P}$ if $e \in M^*$ and from $\mathcal{Q}$ if $e \notin M^*$. We show that if $\sqrt{d} B(\mathcal{P},\mathcal{Q}) \le 1$, where $B(\mathcal{P},\mathcal{Q})$ stands for the Bhattacharyya coefficient, the reconstruction error (average fraction of misclassified edges) of the maximum likelihood estimator of $M^*$ converges to $0$ as $n\to \infty$. Conversely, if $\sqrt{d} B(\mathcal{P},\mathcal{Q}) \ge 1+ε$ for an arbitrarily small constant $ε>0$, the reconstruction error for any estimator is shown to be bounded away from $0$ under both the sparse and dense model, resolving the conjecture in [Moharrami et al. 2019, Semerjian et al. 2020]. Furthermore, in the special case of complete exponentially weighted graph with $d=n$, $\mathcal{P}=\exp(λ)$, and $\mathcal{Q}=\exp(1/n)$, for which the sharp threshold simplifies to $λ=4$, we prove that when $λ\le 4-ε$, the optimal reconstruction error is $\exp\left( - Θ(1/\sqrtε) \right)$, confirming the conjectured infinite-order phase transition in [Semerjian et al. 2020].

7.4MLFeb 23, 2021

Learner-Private Convex Optimization

Jiaming Xu, Kuang Xu, Dana Yang

Convex optimization with feedback is a framework where a learner relies on iterative queries and feedback to arrive at the minimizer of a convex function. It has gained considerable popularity thanks to its scalability in large-scale optimization and machine learning. The repeated interactions, however, expose the learner to privacy risks from eavesdropping adversaries that observe the submitted queries. In this paper, we study how to optimally obfuscate the learner's queries in convex optimization with first-order feedback, so that their learned optimal value is provably difficult to estimate for an eavesdropping adversary. We consider two formulations of learner privacy: a Bayesian formulation in which the convex function is drawn randomly, and a minimax formulation in which the function is fixed and the adversary's probability of error is measured with respect to a minimax criterion. Suppose that the learner wishes to ensure the adversary cannot estimate accurately with probability greater than $1/L$ for some $L>0$. Our main results show that the query complexity overhead is additive in $L$ in the minimax formulation, but multiplicative in $L$ in the Bayesian formulation. Compared to existing learner-private sequential learning models with binary feedback, our results apply to the significantly richer family of general convex functions with full-gradient feedback. Our proofs learn on tools from the theory of Dirichlet processes, as well as a novel strategy designed for measuring information leakage under a full-gradient oracle.

5.9DSFeb 23, 2021

The Power of $D$-hops in Matching Power-Law Graphs

Liren Yu, Jiaming Xu, Xiaojun Lin

This paper studies seeded graph matching for power-law graphs. Assume that two edge-correlated graphs are independently edge-sampled from a common parent graph with a power-law degree distribution. A set of correctly matched vertex-pairs is chosen at random and revealed as initial seeds. Our goal is to use the seeds to recover the remaining latent vertex correspondence between the two graphs. Departing from the existing approaches that focus on the use of high-degree seeds in $1$-hop neighborhoods, we develop an efficient algorithm that exploits the low-degree seeds in suitably-defined $D$-hop neighborhoods. Specifically, we first match a set of vertex-pairs with appropriate degrees (which we refer to as the first slice) based on the number of low-degree seeds in their $D$-hop neighborhoods. This significantly reduces the number of initial seeds needed to trigger a cascading process to match the rest of the graphs. Under the Chung-Lu random graph model with $n$ vertices, max degree $Θ(\sqrt{n})$, and the power-law exponent $2<β<3$, we show that as soon as $D> \frac{4-β}{3-β}$, by optimally choosing the first slice, with high probability our algorithm can correctly match a constant fraction of the true pairs without any error, provided with only $Ω((\log n)^{4-β})$ initial seeds. Our result achieves an exponential reduction in the seed size requirement, as the best previously known result requires $n^{1/2+ε}$ seeds (for any small constant $ε>0$). Performance evaluation with synthetic and real data further corroborates the improved performance of our algorithm.

15.2STJan 29, 2021

Settling the Sharp Reconstruction Thresholds of Random Graph Matching

Yihong Wu, Jiaming Xu, Sophie H. Yu

This paper studies the problem of recovering the hidden vertex correspondence between two edge-correlated random graphs. We focus on the Gaussian model where the two graphs are complete graphs with correlated Gaussian weights and the Erdős-Rényi model where the two graphs are subsampled from a common parent Erdős-Rényi graph $\mathcal{G}(n,p)$. For dense graphs with $p=n^{-o(1)}$, we prove that there exists a sharp threshold, above which one can correctly match all but a vanishing fraction of vertices and below which correctly matching any positive fraction is impossible, a phenomenon known as the "all-or-nothing" phase transition. Even more strikingly, in the Gaussian setting, above the threshold all vertices can be exactly matched with high probability. In contrast, for sparse Erdős-Rényi graphs with $p=n^{-Θ(1)}$, we show that the all-or-nothing phenomenon no longer holds and we determine the thresholds up to a constant factor. Along the way, we also derive the sharp threshold for exact recovery, sharpening the existing results in Erdős-Rényi graphs. The proof of the negative results builds upon a tight characterization of the mutual information based on the truncated second-moment computation and an "area theorem" that relates the mutual information to the integral of the reconstruction error. The positive results follows from a tight analysis of the maximum likelihood estimator that takes into account the cycle structure of the induced permutation on the edges.

11.7STAug 23, 2020

Testing correlation of unlabeled random graphs

Yihong Wu, Jiaming Xu, Sophie H. Yu

We study the problem of detecting the edge correlation between two random graphs with $n$ unlabeled nodes. This is formalized as a hypothesis testing problem, where under the null hypothesis, the two graphs are independently generated; under the alternative, the two graphs are edge-correlated under some latent node correspondence, but have the same marginal distributions as the null. For both Gaussian-weighted complete graphs and dense Erdős-Rényi graphs (with edge probability $n^{-o(1)}$), we determine the sharp threshold at which the optimal testing error probability exhibits a phase transition from zero to one as $n\to \infty$. For sparse Erdős-Rényi graphs with edge probability $n^{-Ω(1)}$, we determine the threshold within a constant factor. The proof of the impossibility results is an application of the conditional second-moment method, where we bound the truncated second moment of the likelihood ratio by carefully conditioning on the typical behavior of the intersection graph (consisting of edges in both observed graphs) and taking into account the cycle structure of the induced random permutation on the edges. Notably, in the sparse regime, this is accomplished by leveraging the pseudoforest structure of subcritical Erdős-Rényi graphs and a careful enumeration of subpseudoforests that can be assembled from short orbits of the edge permutation.

6.6DSApr 8, 2020Code

Graph Matching with Partially-Correct Seeds

Liren Yu, Jiaming Xu, Xiaojun Lin

Graph matching aims to find the latent vertex correspondence between two edge-correlated graphs and has found numerous applications across different fields. In this paper, we study a seeded graph matching problem, which assumes that a set of seeds, i.e., pre-mapped vertex-pairs, is given in advance. While most previous work requires all seeds to be correct, we focus on the setting where the seeds are partially correct. Specifically, consider two correlated graphs whose edges are sampled independently from a parent \ER graph $\mathcal{G}(n,p)$. A mapping between the vertices of the two graphs is provided as seeds, of which an unknown $β$ fraction is correct. We first analyze a simple algorithm that matches vertices based on the number of common seeds in the $1$-hop neighborhoods, and then further propose a new algorithm that uses seeds in the $2$-hop neighborhoods. We establish non-asymptotic performance guarantees of perfect matching for both $1$-hop and $2$-hop algorithms, showing that our new $2$-hop algorithm requires substantially fewer correct seeds than the $1$-hop algorithm when graphs are sparse. Moreover, by combining our new performance guarantees for the $1$-hop and $2$-hop algorithms, we attain the best-known results (in terms of the required fraction of correct seeds) across the entire range of graph sparsity and significantly improve the previous results in \cite{10.14778/2794367.2794371,lubars2018correcting} when $p\ge n^{-5/6}$. For instance, when $p$ is a constant or $p=n^{-3/4}$, we show that only $Ω(\sqrt{n\log n})$ correct seeds suffice for perfect matching, while the previously best-known results demand $Ω(n)$ and $Ω(n^{3/4}\log n)$ correct seeds, respectively. Numerical experiments corroborate our theoretical findings, demonstrating the superiority of our $2$-hop algorithm on a variety of synthetic and real graphs.

4.3DSNov 18, 2019

Consistent recovery threshold of hidden nearest neighbor graphs

Jian Ding, Yihong Wu, Jiaming Xu et al.

Motivated by applications such as discovering strong ties in social networks and assembling genome subsequences in biology, we study the problem of recovering a hidden $2k$-nearest neighbor (NN) graph in an $n$-vertex complete graph, whose edge weights are independent and distributed according to $P_n$ for edges in the hidden $2k$-NN graph and $Q_n$ otherwise. The special case of Bernoulli distributions corresponds to a variant of the Watts-Strogatz small-world graph. We focus on two types of asymptotic recovery guarantees as $n\to \infty$: (1) exact recovery: all edges are classified correctly with probability tending to one; (2) almost exact recovery: the expected number of misclassified edges is $o(nk)$. We show that the maximum likelihood estimator achieves (1) exact recovery for $2 \le k \le n^{o(1)}$ if $ \liminf \frac{2α_n}{\log n}>1$; (2) almost exact recovery for $ 1 \le k \le o\left( \frac{\log n}{\log \log n} \right)$ if $\liminf \frac{kD(P_n||Q_n)}{\log n}>1$, where $α_n \triangleq -2 \log \int \sqrt{d P_n d Q_n}$ is the Rényi divergence of order $\frac{1}{2}$ and $D(P_n||Q_n)$ is the Kullback-Leibler divergence. Under mild distributional assumptions, these conditions are shown to be information-theoretically necessary for any algorithm to succeed. A key challenge in the analysis is the enumeration of $2k$-NN graphs that differ from the hidden one by a given number of edges.

5.7MLSep 21, 2019

Optimal query complexity for private sequential learning against eavesdropping

Jiaming Xu, Kuang Xu, Dana Yang

We study the query complexity of a learner-private sequential learning problem, motivated by the privacy and security concerns due to eavesdropping that arise in practical applications such as pricing and Federated Learning. A learner tries to estimate an unknown scalar value, by sequentially querying an external database and receiving binary responses; meanwhile, a third-party adversary observes the learner's queries but not the responses. The learner's goal is to design a querying strategy with the minimum number of queries (optimal query complexity) so that she can accurately estimate the true value, while the eavesdropping adversary even with the complete knowledge of her querying strategy cannot. We develop new querying strategies and analytical techniques and use them to prove tight upper and lower bounds on the optimal query complexity. The bounds almost match across the entire parameter range, substantially improving upon existing results. We thus obtain a complete picture of the optimal query complexity as a function of the estimation accuracy and the desired levels of privacy. We also extend the results to sequential learning models in higher dimensions, and where the binary responses are noisy. Our analysis leverages a crucial insight into the nature of private learning problem, which suggests that the query trajectory of an optimal learner can be divided into distinct phases that focus on pure learning versus learning and obfuscation, respectively.

12.2PRJul 20, 2019

Spectral Graph Matching and Regularized Quadratic Relaxations II: Erdős-Rényi Graphs and Universality

Zhou Fan, Cheng Mao, Yihong Wu et al.

We analyze a new spectral graph matching algorithm, GRAph Matching by Pairwise eigen-Alignments (GRAMPA), for recovering the latent vertex correspondence between two unlabeled, edge-correlated weighted graphs. Extending the exact recovery guarantees established in the companion paper for Gaussian weights, in this work, we prove the universality of these guarantees for a general correlated Wigner model. In particular, for two Erdős-Rényi graphs with edge correlation coefficient $1-σ^2$ and average degree at least $\operatorname{polylog}(n)$, we show that GRAMPA exactly recovers the latent vertex correspondence with high probability when $σ\lesssim 1/\operatorname{polylog}(n)$. Moreover, we establish a similar guarantee for a variant of GRAMPA, corresponding to a tighter quadratic programming relaxation of the quadratic assignment problem. Our analysis exploits a resolvent representation of the GRAMPA similarity matrix and local laws for the resolvents of sparse Wigner matrices.

11.3MLJul 20, 2019

Spectral Graph Matching and Regularized Quadratic Relaxations I: The Gaussian Model

Zhou Fan, Cheng Mao, Yihong Wu et al.

Graph matching aims at finding the vertex correspondence between two unlabeled graphs that maximizes the total edge weight correlation. This amounts to solving a computationally intractable quadratic assignment problem. In this paper we propose a new spectral method, GRAph Matching by Pairwise eigen-Alignments (GRAMPA). Departing from prior spectral approaches that only compare top eigenvectors, or eigenvectors of the same order, GRAMPA first constructs a similarity matrix as a weighted sum of outer products between all pairs of eigenvectors of the two graphs, with weights given by a Cauchy kernel applied to the separation of the corresponding eigenvalues, then outputs a matching by a simple rounding procedure. The similarity matrix can also be interpreted as the solution to a regularized quadratic programming relaxation of the quadratic assignment problem. For the Gaussian Wigner model in which two complete graphs on $n$ vertices have Gaussian edge weights with correlation coefficient $1-σ^2$, we show that GRAMPA exactly recovers the correct vertex correspondence with high probability when $σ= O(\frac{1}{\log n})$. This matches the state of the art of polynomial-time algorithms, and significantly improves over existing spectral methods which require $σ$ to be polynomially small in $n$. The superiority of GRAMPA is also demonstrated on a variety of synthetic and real datasets, in terms of both statistical accuracy and computational efficiency. Universality results, including similar guarantees for dense and sparse Erdős-Rényi graphs, are deferred to the companion paper.

27.4IRMay 6, 2019

POG: Personalized Outfit Generation for Fashion Recommendation at Alibaba iFashion

Wen Chen, Pipei Huang, Jiaming Xu et al.

Increasing demand for fashion recommendation raises a lot of challenges for online shopping platforms and fashion communities. In particular, there exist two requirements for fashion outfit recommendation: the Compatibility of the generated fashion outfits, and the Personalization in the recommendation process. In this paper, we demonstrate these two requirements can be satisfied via building a bridge between outfit generation and recommendation. Through large data analysis, we observe that people have similar tastes in individual items and outfits. Therefore, we propose a Personalized Outfit Generation (POG) model, which connects user preferences regarding individual items and outfits with Transformer architecture. Extensive offline and online experiments provide strong quantitative evidence that our method outperforms alternative methods regarding both compatibility and personalization metrics. Furthermore, we deploy POG on a platform named Dida in Alibaba to generate personalized outfits for the users of the online application iFashion. This work represents a first step towards an industrial-scale fashion outfit generation and recommendation solution, which goes beyond generating outfits based on explicit queries, or merely recommending from existing outfit pools. As part of this work, we release a large-scale dataset consisting of 1.01 million outfits with rich context information, and 0.28 billion user click actions from 3.57 million users. To the best of our knowledge, this dataset is the largest, publicly available, fashion related dataset, and the first to provide user behaviors relating to both outfits and fashion items.

17.8MLNov 19, 2018

Efficient random graph matching via degree profiles

Jian Ding, Zongming Ma, Yihong Wu et al.

Random graph matching refers to recovering the underlying vertex correspondence between two random graphs with correlated edges; a prominent example is when the two random graphs are given by Erdős-Rényi graphs $G(n,\frac{d}{n})$. This can be viewed as an average-case and noisy version of the graph isomorphism problem. Under this model, the maximum likelihood estimator is equivalent to solving the intractable quadratic assignment problem. This work develops an $\tilde{O}(n d^2+n^2)$-time algorithm which perfectly recovers the true vertex correspondence with high probability, provided that the average degree is at least $d = Ω(\log^2 n)$ and the two graphs differ by at most $δ= O( \log^{-2}(n) )$ fraction of edges. For dense graphs and sparse graphs, this can be improved to $δ= O( \log^{-2/3}(n) )$ and $δ= O( \log^{-2}(d) )$ respectively, both in polynomial time. The methodology is based on appropriately chosen distance statistics of the degree profiles (empirical distribution of the degrees of neighbors). Before this work, the best known result achieves $δ=O(1)$ and $n^{o(1)} \leq d \leq n^c$ for some constant $c$ with an $n^{O(\log n)}$-time algorithm \cite{barak2018nearly} and $δ=\tilde O((d/n)^4)$ and $d = \tildeΩ(n^{4/5})$ with a polynomial-time algorithm \cite{dai2018performance}.

16.7STSep 30, 2018

Convex Relaxation Methods for Community Detection

Xiaodong Li, Yudong Chen, Jiaming Xu

This paper surveys recent theoretical advances in convex optimization approaches for community detection. We introduce some important theoretical techniques and results for establishing the consistency of convex community detection under various statistical models. In particular, we discuss the basic techniques based on the primal and dual analysis. We also present results that demonstrate several distinctive advantages of convex community detection, including robustness against outlier nodes, consistency under weak assortativity, and adaptivity to heterogeneous degrees. This survey is not intended to be a complete overview of the vast literature on this fast-growing topic. Instead, we aim to provide a big picture of the remarkable recent development in this area and to make the survey accessible to a broad audience. We hope that this expository article can serve as an introductory guide for readers who are interested in using, designing, and analyzing convex relaxation methods in network analysis.

12.5LGJul 26, 2018

Seeded Graph Matching via Large Neighborhood Statistics

Elchanan Mossel, Jiaming Xu

We study a well known noisy model of the graph isomorphism problem. In this model, the goal is to perfectly recover the vertex correspondence between two edge-correlated Erdős-Rényi random graphs, with an initial seed set of correctly matched vertex pairs revealed as side information. For seeded problems, our result provides a significant improvement over previously known results. We show that it is possible to achieve the information-theoretic limit of graph sparsity in time polynomial in the number of vertices $n$. Moreover, we show the number of seeds needed for exact recovery in polynomial-time can be as low as $n^{3ε}$ in the sparse graph regime (with the average degree smaller than $n^ε$) and $Ω(\log n)$ in the dense graph regime. Our results also shed light on the unseeded problem. In particular, we give sub-exponential time algorithms for sparse models and an $n^{O(\log n)}$ algorithm for dense models for some parameters, including some that are not covered by recent results of Barak et al.

20.6DCApr 26, 2018

Securing Distributed Gradient Descent in High Dimensional Statistical Learning

Lili Su, Jiaming Xu

We consider unreliable distributed learning systems wherein the training data is kept confidential by external workers, and the learner has to interact closely with those workers to train a model. In particular, we assume that there exists a system adversary that can adaptively compromise some workers; the compromised workers deviate from their local designed specifications by sending out arbitrarily malicious messages. We assume in each communication round, up to $q$ out of the $m$ workers suffer Byzantine faults. Each worker keeps a local sample of size $n$ and the total sample size is $N=nm$. We propose a secured variant of the gradient descent method that can tolerate up to a constant fraction of Byzantine workers, i.e., $q/m = O(1)$. Moreover, we show the statistical estimation error of the iterates converges in $O(\log N)$ rounds to $O(\sqrt{q/N} + \sqrt{d/N})$, where $d$ is the model dimension. As long as $q=O(d)$, our proposed algorithm achieves the optimal error rate $O(\sqrt{d/N})$. Our results are obtained under some technical assumptions. Specifically, we assume strongly-convex population risk. Nevertheless, the empirical risk (sample version) is allowed to be non-convex. The core of our method is to robustly aggregate the gradients computed by the workers based on the filtering procedure proposed by Steinhardt et al. On the technical front, deviating from the existing literature on robustly estimating a finite-dimensional mean vector, we establish a {\em uniform} concentration of the sample covariance matrix of gradients, and show that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function. To get a near-optimal uniform concentration bound, we develop a new matrix concentration inequality, which might be of independent interest.

8.6DMApr 15, 2018

Hidden Hamiltonian Cycle Recovery via Linear Programming

Vivek Bagaria, Jian Ding, David Tse et al.

We introduce the problem of hidden Hamiltonian cycle recovery, where there is an unknown Hamiltonian cycle in an $n$-vertex complete graph that needs to be inferred from noisy edge measurements. The measurements are independent and distributed according to $\calP_n$ for edges in the cycle and $\calQ_n$ otherwise. This formulation is motivated by a problem in genome assembly, where the goal is to order a set of contigs (genome subsequences) according to their positions on the genome using long-range linking measurements between the contigs. Computing the maximum likelihood estimate in this model reduces to a Traveling Salesman Problem (TSP). Despite the NP-hardness of TSP, we show that a simple linear programming (LP) relaxation, namely the fractional $2$-factor (F2F) LP, recovers the hidden Hamiltonian cycle with high probability as $n \to \infty$ provided that $α_n - \log n \to \infty$, where $α_n \triangleq -2 \log \int \sqrt{d P_n d Q_n}$ is the Rényi divergence of order $\frac{1}{2}$. This condition is information-theoretically optimal in the sense that, under mild distributional assumptions, $α_n \geq (1+o(1)) \log n$ is necessary for any algorithm to succeed regardless of the computational cost. Departing from the usual proof techniques based on dual witness construction, the analysis relies on the combinatorial characterization (in particular, the half-integrality) of the extreme points of the F2F polytope. Represented as bicolored multi-graphs, these extreme points are further decomposed into simpler "blossom-type" structures for the large deviation analysis and counting arguments. Evaluation of the algorithm on real data shows improvements over existing approaches.

17.8MLSep 10, 2017

Rates of Convergence of Spectral Methods for Graphon Estimation

Jiaming Xu

This paper studies the problem of estimating the grahpon model - the underlying generating mechanism of a network. Graphon estimation arises in many applications such as predicting missing links in networks and learning user preferences in recommender systems. The graphon model deals with a random graph of $n$ vertices such that each pair of two vertices $i$ and $j$ are connected independently with probability $ρ\times f(x_i,x_j)$, where $x_i$ is the unknown $d$-dimensional label of vertex $i$, $f$ is an unknown symmetric function, and $ρ$ is a scaling parameter characterizing the graph sparsity. Recent studies have identified the minimax error rate of estimating the graphon from a single realization of the random graph. However, there exists a wide gap between the known error rates of computationally efficient estimation procedures and the minimax optimal error rate. Here we analyze a spectral method, namely universal singular value thresholding (USVT) algorithm, in the relatively sparse regime with the average vertex degree $nρ=Ω(\log n)$. When $f$ belongs to Hölder or Sobolev space with smoothness index $α$, we show the error rate of USVT is at most $(nρ)^{ -2 α/ (2α+d)}$, approaching the minimax optimal error rate $\log (nρ)/(nρ)$ for $d=1$ as $α$ increases. Furthermore, when $f$ is analytic, we show the error rate of USVT is at most $\log^d (nρ)/(nρ)$. In the special case of stochastic block model with $k$ blocks, the error rate of USVT is at most $k/(nρ)$, which is larger than the minimax optimal error rate by at most a multiplicative factor $k/\log k$. This coincides with the computational gap observed for community detection. A key step of our analysis is to derive the eigenvalue decaying rate of the edge probability matrix using piecewise polynomial approximations of the graphon function $f$.

12.4MLApr 24, 2017

Learning from Comparisons and Choices

Sahand Negahban, Sewoong Oh, Kiran K. Thekumparampil et al.

When tracking user-specific online activities, each user's preference is revealed in the form of choices and comparisons. For example, a user's purchase history is a record of her choices, i.e. which item was chosen among a subset of offerings. A user's preferences can be observed either explicitly as in movie ratings or implicitly as in viewing times of news articles. Given such individualized ordinal data in the form of comparisons and choices, we address the problem of collaboratively learning representations of the users and the items. The learned features can be used to predict a user's preference of an unseen item to be used in recommendation systems. This also allows one to compute similarities among users and items to be used for categorization and search. Motivated by the empirical successes of the MultiNomial Logit (MNL) model in marketing and transportation, and also more recent successes in word embedding and crowdsourced image embedding, we pose this problem as learning the MNL model parameters that best explain the data. We propose a convex relaxation for learning the MNL model, and show that it is minimax optimal up to a logarithmic factor by comparing its performance to a fundamental lower bound. This characterizes the minimax sample complexity of the problem, and proves that the proposed estimator cannot be improved upon other than by a logarithmic factor. Further, the analysis identifies how the accuracy depends on the topology of sampling via the spectrum of the sampling graph. This provides a guideline for designing surveys when one can choose which items are to be compared. This is accompanied by numerical simulations on synthetic and real data sets, confirming our theoretical predictions.

10.8MLFeb 20, 2016

Semidefinite Programs for Exact Recovery of a Hidden Community

Bruce Hajek, Yihong Wu, Jiaming Xu

We study a semidefinite programming (SDP) relaxation of the maximum likelihood estimation for exactly recovering a hidden community of cardinality $K$ from an $n \times n$ symmetric data matrix $A$, where for distinct indices $i,j$, $A_{ij} \sim P$ if $i, j$ are both in the community and $A_{ij} \sim Q$ otherwise, for two known probability distributions $P$ and $Q$. We identify a sufficient condition and a necessary condition for the success of SDP for the general model. For both the Bernoulli case ($P={\rm Bern}(p)$ and $Q={\rm Bern}(q)$ with $p>q$) and the Gaussian case ($P=\mathcal{N}(μ,1)$ and $Q=\mathcal{N}(0,1)$ with $μ>0$), which correspond to the problem of planted dense subgraph recovery and submatrix localization respectively, the general results lead to the following findings: (1) If $K=ω( n /\log n)$, SDP attains the information-theoretic recovery limits with sharp constants; (2) If $K=Θ(n/\log n)$, SDP is order-wise optimal, but strictly suboptimal by a constant factor; (3) If $K=o(n/\log n)$ and $K \to \infty$, SDP is order-wise suboptimal. The same critical scaling for $K$ is found to hold, up to constant factors, for the performance of SDP on the stochastic block model of $n$ vertices partitioned into multiple communities of equal size $K$. A key ingredient in the proof of the necessary condition is a construction of a primal feasible solution based on random perturbation of the true cluster matrix.

18.3STDec 28, 2015

Convexified Modularity Maximization for Degree-corrected Stochastic Block Models

Yudong Chen, Xiaodong Li, Jiaming Xu

The stochastic block model (SBM) is a popular framework for studying community detection in networks. This model is limited by the assumption that all nodes in the same community are statistically equivalent and have equal expected degrees. The degree-corrected stochastic block model (DCSBM) is a natural extension of SBM that allows for degree heterogeneity within communities. This paper proposes a convexified modularity maximization approach for estimating the hidden communities under DCSBM. Our approach is based on a convex programming relaxation of the classical (generalized) modularity maximization formulation, followed by a novel doubly-weighted $ \ell_1 $-norm $ k $-median procedure. We establish non-asymptotic theoretical guarantees for both approximate clustering and perfect clustering. Our approximate clustering results are insensitive to the minimum degree, and hold even in sparse regime with bounded average degrees. In the special case of SBM, these theoretical results match the best-known performance guarantees of computationally feasible algorithms. Numerically, we provide an efficient implementation of our algorithm, which is applied to both synthetic and real-world networks. Experiment results show that our method enjoys competitive performance compared to the state of the art in the literature.

12.9MLOct 30, 2015

Submatrix localization via message passing

Bruce Hajek, Yihong Wu, Jiaming Xu

The principal submatrix localization problem deals with recovering a $K\times K$ principal submatrix of elevated mean $μ$ in a large $n\times n$ symmetric matrix subject to additive standard Gaussian noise. This problem serves as a prototypical example for community detection, in which the community corresponds to the support of the submatrix. The main result of this paper is that in the regime $Ω(\sqrt{n}) \leq K \leq o(n)$, the support of the submatrix can be weakly recovered (with $o(K)$ misclassification errors on average) by an optimized message passing algorithm if $λ= μ^2K^2/n$, the signal-to-noise ratio, exceeds $1/e$. This extends a result by Deshpande and Montanari previously obtained for $K=Θ(\sqrt{n}).$ In addition, the algorithm can be extended to provide exact recovery whenever information-theoretically possible and achieve the information limit of exact recovery as long as $K \geq \frac{n}{\log n} (\frac{1}{8e} + o(1))$. The total running time of the algorithm is $O(n^2\log n)$. Another version of the submatrix localization problem, known as noisy biclustering, aims to recover a $K_1\times K_2$ submatrix of elevated mean $μ$ in a large $n_1\times n_2$ Gaussian matrix. The optimized message passing algorithm and its analysis are adapted to the bicluster problem assuming $Ω(\sqrt{n_i}) \leq K_i \leq o(n_i)$ and $K_1\asymp K_2.$ A sharp information-theoretic condition for the weak recovery of both clusters is also identified.

11.8MLOct 9, 2015

Recovering a Hidden Community Beyond the Kesten-Stigum Threshold in $O(|E| \log^*|V|)$ Time

Bruce Hajek, Yihong Wu, Jiaming Xu

Community detection is considered for a stochastic block model graph of n vertices, with K vertices in the planted community, edge probability p for pairs of vertices both in the community, and edge probability q for other pairs of vertices. The main focus of the paper is on weak recovery of the community based on the graph G, with o(K) misclassified vertices on average, in the sublinear regime $n^{1-o(1)} \leq K \leq o(n).$ A critical parameter is the effective signal-to-noise ratio $λ=K^2(p-q)^2/((n-K)q)$, with $λ=1$ corresponding to the Kesten-Stigum threshold. We show that a belief propagation algorithm achieves weak recovery if $λ>1/e$, beyond the Kesten-Stigum threshold by a factor of $1/e.$ The belief propagation algorithm only needs to run for $\log^\ast n+O(1) $ iterations, with the total time complexity $O(|E| \log^*n)$, where $\log^*n$ is the iterated logarithm of $n.$ Conversely, if $λ\leq 1/e$, no local algorithm can asymptotically outperform trivial random guessing. Furthermore, a linear message-passing algorithm that corresponds to applying power iteration to the non-backtracking matrix of the graph is shown to attain weak recovery if and only if $λ>1$. In addition, the belief propagation algorithm can be combined with a linear-time voting procedure to achieve the information limit of exact recovery (correctly classify all vertices with high probability) for all $K \ge \frac{n}{\log n} \left( ρ_{\rm BP} +o(1) \right),$ where $ρ_{\rm BP}$ is a function of $p/q$.

18.7MLSep 25, 2015

Information Limits for Recovering a Hidden Community

Bruce Hajek, Yihong Wu, Jiaming Xu

We study the problem of recovering a hidden community of cardinality $K$ from an $n \times n$ symmetric data matrix $A$, where for distinct indices $i,j$, $A_{ij} \sim P$ if $i, j$ both belong to the community and $A_{ij} \sim Q$ otherwise, for two known probability distributions $P$ and $Q$ depending on $n$. If $P={\rm Bern}(p)$ and $Q={\rm Bern}(q)$ with $p>q$, it reduces to the problem of finding a densely-connected $K$-subgraph planted in a large Erdös-Rényi graph; if $P=\mathcal{N}(μ,1)$ and $Q=\mathcal{N}(0,1)$ with $μ>0$, it corresponds to the problem of locating a $K \times K$ principal submatrix of elevated means in a large Gaussian random matrix. We focus on two types of asymptotic recovery guarantees as $n \to \infty$: (1) weak recovery: expected number of classification errors is $o(K)$; (2) exact recovery: probability of classifying all indices correctly converges to one. Under mild assumptions on $P$ and $Q$, and allowing the community size to scale sublinearly with $n$, we derive a set of sufficient conditions and a set of necessary conditions for recovery, which are asymptotically tight with sharp constants. The results hold in particular for the Gaussian case, and for the case of bounded log likelihood ratio, including the Bernoulli case whenever $\frac{p}{q}$ and $\frac{1-p}{1-q}$ are bounded away from zero and infinity. An important algorithmic implication is that, whenever exact recovery is information theoretically possible, any algorithm that provides weak recovery when the community size is concentrated near $K$ can be upgraded to achieve exact recovery in linear additional time by a simple voting procedure.

10.0MLSep 10, 2015

Density Evolution in the Degree-correlated Stochastic Block Model

Elchanan Mossel, Jiaming Xu

There is a recent surge of interest in identifying the sharp recovery thresholds for cluster recovery under the stochastic block model. In this paper, we address the more refined question of how many vertices that will be misclassified on average. We consider the binary form of the stochastic block model, where $n$ vertices are partitioned into two clusters with edge probability $a/n$ within the first cluster, $c/n$ within the second cluster, and $b/n$ across clusters. Suppose that as $n \to \infty$, $a= b+ μ\sqrt{ b} $, $c=b+ ν\sqrt{ b} $ for two fixed constants $μ, ν$, and $b \to \infty$ with $b=n^{o(1)}$. When the cluster sizes are balanced and $μ\neq ν$, we show that the minimum fraction of misclassified vertices on average is given by $Q(\sqrt{v^*})$, where $Q(x)$ is the Q-function for standard normal, $v^*$ is the unique fixed point of $v= \frac{(μ-ν)^2}{16} + \frac{ (μ+ν)^2 }{16} \mathbb{E}[ \tanh(v+ \sqrt{v} Z)],$ and $Z$ is standard normal. Moreover, the minimum misclassified fraction on average is attained by a local algorithm, namely belief propagation, in time linear in the number of edges. Our proof techniques are based on connecting the cluster recovery problem to tree reconstruction problems, and analyzing the density evolution of belief propagation on trees with Gaussian approximations.

16.3MLAug 10, 2015

Local Algorithms for Block Models with Side Information

Elchanan Mossel, Jiaming Xu

There has been a recent interest in understanding the power of local algorithms for optimization and inference problems on sparse graphs. Gamarnik and Sudan (2014) showed that local algorithms are weaker than global algorithms for finding large independent sets in sparse random regular graphs. Montanari (2015) showed that local algorithms are suboptimal for finding a community with high connectivity in the sparse Erdős-Rényi random graphs. For the symmetric planted partition problem (also named community detection for the block models) on sparse graphs, a simple observation is that local algorithms cannot have non-trivial performance. In this work we consider the effect of side information on local algorithms for community detection under the binary symmetric stochastic block model. In the block model with side information each of the $n$ vertices is labeled $+$ or $-$ independently and uniformly at random; each pair of vertices is connected independently with probability $a/n$ if both of them have the same label or $b/n$ otherwise. The goal is to estimate the underlying vertex labeling given 1) the graph structure and 2) side information in the form of a vertex labeling positively correlated with the true one. Assuming that the ratio between in and out degree $a/b$ is $Θ(1)$ and the average degree $ (a+b) / 2 = n^{o(1)}$, we characterize three different regimes under which a local algorithm, namely, belief propagation run on the local neighborhoods, maximizes the expected fraction of vertices labeled correctly. Thus, in contrast to the case of symmetric block models without side information, we show that local algorithms can achieve optimal performance for the block model with side information.

9.5LGJun 26, 2015

Collaboratively Learning Preferences from Ordinal Data

Sewoong Oh, Kiran K. Thekumparampil, Jiaming Xu

In applications such as recommendation systems and revenue management, it is important to predict preferences on items that have not been seen by a user or predict outcomes of comparisons among those that have never been compared. A popular discrete choice model of multinomial logit model captures the structure of the hidden preferences with a low-rank matrix. In order to predict the preferences, we want to learn the underlying model from noisy observations of the low-rank matrix, collected as revealed preferences in various forms of ordinal data. A natural approach to learn such a model is to solve a convex relaxation of nuclear norm minimization. We present the convex relaxation approach in two contexts of interest: collaborative ranking and bundled choice modeling. In both cases, we show that the convex relaxation is minimax optimal. We prove an upper bound on the resulting error with finite samples, and provide a matching information-theoretic lower bound.

21.8MLFeb 26, 2015

Achieving Exact Cluster Recovery Threshold via Semidefinite Programming: Extensions

Bruce Hajek, Yihong Wu, Jiaming Xu

Resolving a conjecture of Abbe, Bandeira and Hall, the authors have recently shown that the semidefinite programming (SDP) relaxation of the maximum likelihood estimator achieves the sharp threshold for exactly recovering the community structure under the binary stochastic block model of two equal-sized clusters. The same was shown for the case of a single cluster and outliers. Extending the proof techniques, in this paper it is shown that SDP relaxations also achieve the sharp recovery threshold in the following cases: (1) Binary stochastic block model with two clusters of sizes proportional to network size but not necessarily equal; (2) Stochastic block model with a fixed number of equal-sized clusters; (3) Binary censored block model with the background graph being Erdős-Rényi. Furthermore, a sufficient condition is given for an SDP procedure to achieve exact recovery for the general case of a fixed number of clusters plus outliers. These results demonstrate the versatility of SDP relaxation as a simple, general purpose, computationally feasible methodology for community detection.

10.0MLFeb 16, 2015

Clustering and Inference From Pairwise Comparisons

Rui Wu, Jiaming Xu, R. Srikant et al.

Given a set of pairwise comparisons, the classical ranking problem computes a single ranking that best represents the preferences of all users. In this paper, we study the problem of inferring individual preferences, arising in the context of making personalized recommendations. In particular, we assume that there are $n$ users of $r$ types; users of the same type provide similar pairwise comparisons for $m$ items according to the Bradley-Terry model. We propose an efficient algorithm that accurately estimates the individual preferences for almost all users, if there are $r \max \{m, n\}\log m \log^2 n$ pairwise comparisons per type, which is near optimal in sample complexity when $r$ only grows logarithmically with $m$ or $n$. Our algorithm has three steps: first, for each user, compute the \emph{net-win} vector which is a projection of its $\binom{m}{2}$-dimensional vector of pairwise comparisons onto an $m$-dimensional linear subspace; second, cluster the users based on the net-win vectors; third, estimate a single preference for each cluster separately. The net-win vectors are much less noisy than the high dimensional vectors of pairwise comparisons and clustering is more accurate after the projection as confirmed by numerical experiments. Moreover, we show that, when a cluster is only approximately correct, the maximum likelihood estimation for the Bradley-Terry model is still close to the true preference.

24.9MLNov 24, 2014

Achieving Exact Cluster Recovery Threshold via Semidefinite Programming

Bruce Hajek, Yihong Wu, Jiaming Xu

The binary symmetric stochastic block model deals with a random graph of $n$ vertices partitioned into two equal-sized clusters, such that each pair of vertices is connected independently with probability $p$ within clusters and $q$ across clusters. In the asymptotic regime of $p=a \log n/n$ and $q=b \log n/n$ for fixed $a,b$ and $n \to \infty$, we show that the semidefinite programming relaxation of the maximum likelihood estimator achieves the optimal threshold for exactly recovering the partition from the graph with probability tending to one, resolving a conjecture of Abbe et al. \cite{Abbe14}. Furthermore, we show that the semidefinite programming relaxation also achieves the optimal recovery threshold in the planted dense subgraph model containing a single cluster of size proportional to $n$.

9.2STJun 26, 2014

Edge Label Inference in Generalized Stochastic Block Models: from Spectral Theory to Impossibility Results

Jiaming Xu, Laurent Massoulié, Marc Lelarge

The classical setting of community detection consists of networks exhibiting a clustered structure. To more accurately model real systems we consider a class of networks (i) whose edges may carry labels and (ii) which may lack a clustered structure. Specifically we assume that nodes possess latent attributes drawn from a general compact space and edges between two nodes are randomly generated and labeled according to some unknown distribution as a function of their latent attributes. Our goal is then to infer the edge label distributions from a partially observed network. We propose a computationally efficient spectral algorithm and show it allows for asymptotically correct inference when the average node degree could be as low as logarithmic in the total number of nodes. Conversely, if the average node degree is below a specific constant threshold, we show that no algorithm can achieve better inference than guessing without using the observations. As a byproduct of our analysis, we show that our model provides a general procedure to construct random graph models with a spectrum asymptotic to a pre-specified eigenvalue distribution such as a power-law distribution.

21.2STJun 25, 2014

Computational Lower Bounds for Community Detection on Random Graphs

Bruce Hajek, Yihong Wu, Jiaming Xu

This paper studies the problem of detecting the presence of a small dense community planted in a large Erdős-Rényi random graph $\mathcal{G}(N,q)$, where the edge probability within the community exceeds $q$ by a constant factor. Assuming the hardness of the planted clique detection problem, we show that the computational complexity of detecting the community exhibits the following phase transition phenomenon: As the graph size $N$ grows and the graph becomes sparser according to $q=N^{-α}$, there exists a critical value of $α= \frac{2}{3}$, below which there exists a computationally intensive procedure that can detect far smaller communities than any computationally efficient procedure, and above which a linear-time procedure is statistically optimal. The results also lead to the average-case hardness results for recovering the dense community and approximating the densest $K$-subgraph.

22.9MLJun 21, 2014

Minimax-optimal Inference from Partial Rankings

Bruce Hajek, Sewoong Oh, Jiaming Xu

This paper studies the problem of inferring a global preference based on the partial rankings provided by many users over different subsets of items according to the Plackett-Luce model. A question of particular interest is how to optimally assign items to users for ranking and how many item assignments are needed to achieve a target estimation error. For a given assignment of items to users, we first derive an oracle lower bound of the estimation error that holds even for the more general Thurstone models. Then we show that the Cramér-Rao lower bound and our upper bounds inversely depend on the spectral gap of the Laplacian of an appropriately defined comparison graph. When the system is allowed to choose the item assignment, we propose a random assignment scheme. Our oracle lower bound and upper bounds imply that it is minimax-optimal up to a logarithmic factor among all assignment schemes and the lower bound can be achieved by the maximum likelihood estimator as well as popular rank-breaking schemes that decompose partial rankings into pairwise comparisons. The numerical experiments corroborate our theoretical findings.

28.5MLFeb 6, 2014

Statistical-Computational Tradeoffs in Planted Problems and Submatrix Localization with a Growing Number of Clusters and Submatrices

Yudong Chen, Jiaming Xu

We consider two closely related problems: planted clustering and submatrix localization. The planted clustering problem assumes that a random graph is generated based on some underlying clusters of the nodes; the task is to recover these clusters given the graph. The submatrix localization problem concerns locating hidden submatrices with elevated means inside a large real-valued random matrix. Of particular interest is the setting where the number of clusters/submatrices is allowed to grow unbounded with the problem size. These formulations cover several classical models such as planted clique, planted densest subgraph, planted partition, planted coloring, and stochastic block model, which are widely used for studying community detection and clustering/bi-clustering. For both problems, we show that the space of the model parameters (cluster/submatrix size, cluster density, and submatrix mean) can be partitioned into four disjoint regions corresponding to decreasing statistical and computational complexities: (1) the \emph{impossible} regime, where all algorithms fail; (2) the \emph{hard} regime, where the computationally expensive Maximum Likelihood Estimator (MLE) succeeds; (3) the \emph{easy} regime, where the polynomial-time convexified MLE succeeds; (4) the \emph{simple} regime, where a simple counting/thresholding procedure succeeds. Moreover, we show that each of these algorithms provably fails in the previous harder regimes. Our theorems establish the minimax recovery limit, which are tight up to constants and hold with a growing number of clusters/submatrices, and provide a stronger performance guarantee than previously known for polynomial-time algorithms. Our study demonstrates the tradeoffs between statistical and computational considerations, and suggests that the minimax recovery limit may not be achievable by polynomial-time algorithms.

6.7MLOct 1, 2013

Jointly Clustering Rows and Columns of Binary Matrices: Algorithms and Trade-offs

Jiaming Xu, Rui Wu, Kai Zhu et al.

In standard clustering problems, data points are represented by vectors, and by stacking them together, one forms a data matrix with row or column cluster structure. In this paper, we consider a class of binary matrices, arising in many applications, which exhibit both row and column cluster structure, and our goal is to exactly recover the underlying row and column clusters by observing only a small fraction of noisy entries. We first derive a lower bound on the minimum number of observations needed for exact cluster recovery. Then, we propose three algorithms with different running time and compare the number of observations needed by them for successful cluster recovery. Our analytical results show smooth time-data trade-offs: one can gradually reduce the computational complexity when increasingly more observations are available.