Vivek Bagaria

h-index10

6papers

128citations

Novelty71%

AI Score36

Ranked #98,967 of 194,257 authors (top 51%)#1,398 in ML (top 41%)

6 Papers

11.1MLNov 2, 2017Code

Medoids in almost linear time via multi-armed bandits

Vivek Bagaria, Govinda M. Kamath, Vasilis Ntranos et al.

Computing the medoid of a large number of points in high-dimensional space is an increasingly common operation in many data science problems. We present an algorithm Med-dit which uses O(n log n) distance evaluations to compute the medoid with high probability. Med-dit is based on a connection with the multi-armed bandit problem. We evaluate the performance of Med-dit empirically on the Netflix-prize and the single-cell RNA-Seq datasets, containing hundreds of thousands of points living in tens of thousands of dimensions, and observe a 5-10x improvement in performance over the current state of the art. Med-dit is available at https://github.com/bagavi/Meddit

8.6DCSep 25, 2019Code

Practical Low Latency Proof of Work Consensus

Lei Yang, Xuechao Wang, Vivek Bagaria et al.

Bitcoin is the first fully-decentralized permissionless blockchain protocol to achieve a high level of security, but at the expense of poor throughput and latency. Scaling the performance of Bitcoin has a been a major recent direction of research. One successful direction of work has involved replacing proof of work (PoW) by proof of stake (PoS). Proposals to scale the performance in the PoW setting itself have focused mostly on parallelizing the mining process, scaling throughput; the few proposals to improve latency have either sacrificed throughput or the latency guarantees involve large constants rendering it practically useless. Our first contribution is to design a new PoW blockchain Prism++ that has provably low latency and high throughput; the design retains the parallel-chain approach espoused in Prism but invents a new confirmation rule to infer the permanency of a block by combining information across the parallel chains. We show security at the level of Bitcoin with very small confirmation latency (a small constant factor of block interarrival time). A key aspect to scaling the performance is to use a large number of parallel chains, which puts significant strain on the system. Our second contribution is the design and evaluation of a practical system to efficiently manage the memory, computation, and I/O imperatives of a large number of parallel chains. Our implementation of Prism++ achieves a throughput of over 80,000 transactions per second and confirmation latency of tens of seconds on networks of up to 900 EC2 Virtual Machines.

17.6CROct 18, 2018

Deconstructing the Blockchain to Approach Physical Limits

Vivek Bagaria, Sreeram Kannan, David Tse et al.

Transaction throughput, confirmation latency and confirmation reliability are fundamental performance measures of any blockchain system in addition to its security. In a decentralized setting, these measures are limited by two underlying physical network attributes: communication capacity and speed-of-light propagation delay. Existing systems operate far away from these physical limits. In this work we introduce Prism, a new proof-of-work blockchain protocol, which can achieve 1) security against up to 50% adversarial hashing power; 2) optimal throughput up to the capacity C of the network; 3) confirmation latency for honest transactions proportional to the propagation delay D, with confirmation error probability exponentially small in CD ; 4) eventual total ordering of all transactions. Our approach to the design of this protocol is based on deconstructing the blockchain into its basic functionalities and systematically scaling up these functionalities to approach their physical limits.

10.1LGMay 21, 2018Code

Bandit-Based Monte Carlo Optimization for Nearest Neighbors

Vivek Bagaria, Tavor Z. Baharav, Govinda M. Kamath et al.

The celebrated Monte Carlo method estimates an expensive-to-compute quantity by random sampling. Bandit-based Monte Carlo optimization is a general technique for computing the minimum of many such expensive-to-compute quantities by adaptive random sampling. The technique converts an optimization problem into a statistical estimation problem which is then solved via multi-armed bandits. We apply this technique to solve the problem of high-dimensional $k$-nearest neighbors, developing an algorithm which we prove is able to identify exact nearest neighbors with high probability. We show that under regularity assumptions on a dataset of $n$ points in $d$-dimensional space, the complexity of our algorithm scales logarithmically with the dimension of the data as $O\left((n+d)\log^2 \left(\frac{nd}δ\right)\right)$ for error probability $δ$, rather than linearly as in exact computation requiring $O(nd)$. We corroborate our theoretical results with numerical simulations, showing that our algorithm outperforms both exact computation and state-of-the-art algorithms such as kGraph, NGT, and LSH on real datasets.

8.6DMApr 15, 2018

Hidden Hamiltonian Cycle Recovery via Linear Programming

Vivek Bagaria, Jian Ding, David Tse et al.

We introduce the problem of hidden Hamiltonian cycle recovery, where there is an unknown Hamiltonian cycle in an $n$-vertex complete graph that needs to be inferred from noisy edge measurements. The measurements are independent and distributed according to $\calP_n$ for edges in the cycle and $\calQ_n$ otherwise. This formulation is motivated by a problem in genome assembly, where the goal is to order a set of contigs (genome subsequences) according to their positions on the genome using long-range linking measurements between the contigs. Computing the maximum likelihood estimate in this model reduces to a Traveling Salesman Problem (TSP). Despite the NP-hardness of TSP, we show that a simple linear programming (LP) relaxation, namely the fractional $2$-factor (F2F) LP, recovers the hidden Hamiltonian cycle with high probability as $n \to \infty$ provided that $α_n - \log n \to \infty$, where $α_n \triangleq -2 \log \int \sqrt{d P_n d Q_n}$ is the Rényi divergence of order $\frac{1}{2}$. This condition is information-theoretically optimal in the sense that, under mild distributional assumptions, $α_n \geq (1+o(1)) \log n$ is necessary for any algorithm to succeed regardless of the computational cost. Departing from the usual proof techniques based on dual witness construction, the analysis relies on the combinatorial characterization (in particular, the half-integrality) of the extreme points of the F2F polytope. Represented as bicolored multi-graphs, these extreme points are further decomposed into simpler "blossom-type" structures for the large deviation analysis and counting arguments. Evaluation of the algorithm on real data shows improvements over existing approaches.

15.0MLSep 20, 2017Code

Contrastive Principal Component Analysis

Abubakar Abid, Martin J. Zhang, Vivek K. Bagaria et al.

We present a new technique called contrastive principal component analysis (cPCA) that is designed to discover low-dimensional structure that is unique to a dataset, or enriched in one dataset relative to other data. The technique is a generalization of standard PCA, for the setting where multiple datasets are available -- e.g. a treatment and a control group, or a mixed versus a homogeneous population -- and the goal is to explore patterns that are specific to one of the datasets. We conduct a wide variety of experiments in which cPCA identifies important dataset-specific patterns that are missed by PCA, demonstrating that it is useful for many applications: subgroup discovery, visualizing trends, feature selection, denoising, and data-dependent standardization. We provide geometrical interpretations of cPCA and show that it satisfies desirable theoretical guarantees. We also extend cPCA to nonlinear settings in the form of kernel cPCA. We have released our code as a python package and documentation is on Github.