Alex Olshevsky

LG
h-index41
48papers
575citations
Novelty50%
AI Score49

48 Papers

OCMar 16, 2014
Distributed optimization over time-varying directed graphs

Angelia Nedic, Alex Olshevsky

We consider distributed optimization by a collection of nodes, each having access to its own convex function, whose collective goal is to minimize the sum of the functions. The communications between nodes are described by a time-varying sequence of directed graphs, which is uniformly strongly connected. For such communications, assuming that every node knows its out-degree, we develop a broadcast-based algorithm, termed the subgradient-push, which steers every node to an optimal value under a standard assumption of subgradient boundedness. The subgradient-push requires no knowledge of either the number of agents or the graph sequence to implement. Our analysis shows that the subgradient-push algorithm converges at a rate of $O(\ln(t)/\sqrt{t})$, where the constant depends on the initial values at the nodes, the subgradient norms, and, more interestingly, on both the consensus speed and the imbalances of influence among the nodes.

OCJun 25, 2011
Distributed anonymous discrete function computation

Julien M. Hendrickx, Alex Olshevsky, John N. Tsitsiklis

We propose a model for deterministic distributed function computation by a network of identical and anonymous nodes. In this model, each node has bounded computation and storage capabilities that do not grow with the network size. Furthermore, each node only knows its neighbors, not the entire graph. Our goal is to characterize the class of functions that can be computed within this model. In our main result, we provide a necessary condition for computability which we show to be nearly sufficient, in the sense that every function that satisfies this condition can at least be approximated. The problem of computing suitably rounded averages in a distributed manner plays a central role in our development; we provide an algorithm that solves it in time that grows quadratically with the size of the network.

OCNov 8, 2012
Degree Fluctuations and the Convergence Time of Consensus Algorithms

Alex Olshevsky, John Tsitsiklis

We consider a consensus algorithm in which every node in a sequence of undirected, B-connected graphs assigns equal weight to each of its neighbors. Under the assumption that the degree of each node is fixed (except for times when the node has no connections to other nodes), we show that consensus is achieved within a given accuracy $ε$ on n nodes in time $B+4n^3 B \ln(2n/ε)$. Because there is a direct relation between consensus algorithms in time-varying environments and inhomogeneous random walks, our result also translates into a general statement on such random walks. Moreover, we give a simple proof of a result of Cao, Spielman, and Morse that the worst case convergence time becomes exponentially large in the number of nodes $n$ under slight relaxation of the degree constancy assumption.

OCNov 8, 2012
Nonuniform Coverage Control on the Line

Naomi Ehrich Leonard, Alex Olshevsky

This paper investigates control laws allowing mobile, autonomous agents to optimally position themselves on the line for distributed sensing in a nonuniform field. We show that a simple static control law, based only on local measurements of the field by each agent, drives the agents close to the optimal positions after the agents execute in parallel a number of sensing/movement/computation rounds that is essentially quadratic in the number of agents. Further, we exhibit a dynamic control law which, under slightly stronger assumptions on the capabilities and knowledge of each agent, drives the agents close to the optimal positions after the agents execute in parallel a number of sensing/communication/computation/movement rounds that is essentially linear in the number of agents. Crucially, both algorithms are fully distributed and robust to unpredictable loss and addition of agents.

SYJun 3, 2020
Deterministic and Randomized Actuator Scheduling With Guaranteed Performance Bounds

Milad Siami, Alex Olshevsky, Ali Jadbabaie

In this paper, we investigate the problem of actuator selection for linear dynamical systems. We develop a framework to design a sparse actuator schedule for a given large-scale linear system with guaranteed performance bounds using deterministic polynomial-time and randomized approximately linear-time algorithms. First, we introduce systemic controllability metrics for linear dynamical systems that are monotone and homogeneous with respect to the controllability Gramian. We show that several popular and widely used optimization criteria in the literature belong to this class of controllability metrics. Our main result is to provide a polynomial-time actuator schedule that on average selects only a constant number of actuators at each time step, independent of the dimension, to furnish a guaranteed approximation of the controllability metrics in comparison to when all actuators are in use. Our results naturally apply to the dual problem of sensor selection, in which we provide a guaranteed approximation to the observability Gramian. We illustrate the effectiveness of our theoretical findings via several numerical simulations using benchmark examples.

SYFeb 2, 2016
Convergence Time of Quantized Metropolis Consensus Over Time-Varying Networks

Tamer Basar, Seyed Rasoul Etesami, Alex Olshevsky

We consider the quantized consensus problem on undirected time-varying connected graphs with n nodes, and devise a protocol with fast convergence time to the set of consensus points. Specifically, we show that when the edges of each network in a sequence of connected time-varying networks are activated based on Poisson processes with Metropolis rates, the expected convergence time to the set of consensus points is at most O(n^2 log^2 n), where each node performs a constant number of updates per unit time.

OCMar 7, 2017
Scaling laws for consensus protocols subject to noise

Ali Jadbabaie, Alex Olshevsky

We study the performance of discrete-time consensus protocols in the presence of additive noise. When the consensus dynamic corresponds to a reversible Markov chain, we give an exact expression for a weighted version of steady-state disagreement in terms of the stationary distribution and hitting times in an underlying graph. We then show how this result can be used to characterize the noise robustness of a class of protocols for formation control in terms of the Kemeny constant of an underlying graph.

OCJan 31, 2017
On (Non)Supermodularity of Average Control Energy

Alex Olshevsky

Given a linear system, we consider the expected energy to move from the origin to a uniformly random point on the unit sphere as a function of the set of actuated variables. We show this function is not necessarily supermodular, correcting some claims in the existing literature.

OCMar 4, 2022
A Small Gain Analysis of Single Timescale Actor Critic

Alex Olshevsky, Bahman Gharesifard

We consider a version of actor-critic which uses proportional step-sizes and only one critic update with a single sample from the stationary distribution per actor step. We provide an analysis of this method using the small-gain theorem. Specifically, we prove that this method can be used to find a stationary point, and that the resulting sample complexity improves the state of the art for actor-critic methods to $O \left(μ^{-2} ε^{-2} \right)$ to find an $ε$-approximate stationary point where $μ$ is the condition number associated with the critic.

OCApr 23, 2016
Eigenvalue Clustering, Control Energy, and Logarithmic Capacity

Alex Olshevsky

We prove two bounds showing that if the eigenvalues of a matrix are clustered in a region of the complex plane then the corresponding discrete-time linear system requires significant energy to control. A curious feature of one of our bounds is that the dependence on the region is via its logarithmic capacity, which is a measure of how well a unit of mass may be spread out over the region to minimize a logarithmic potential.

OCFeb 8, 2014
Consensus with Ternary Messages

Alex Olshevsky

We provide a protocol for real-valued average consensus by networks of agents which exchange only a single message from the ternary alphabet {-1,0,1} between neighbors at each step. Our protocol works on time-varying undirected graphs subject to a connectivity condition, has a worst-case convergence time which is polynomial in the number of agents and the initial values, and requires no global knowledge about the graph topologies on the part of each node to implement except for knowing an upper bound on the degrees of its neighbors.

LGNov 29, 2022
Closing the gap between SVRG and TD-SVRG with Gradient Splitting

Arsenii Mustafin, Alex Olshevsky, Ioannis Ch. Paschalidis

Temporal difference (TD) learning is a policy evaluation in reinforcement learning whose performance can be enhanced by variance reduction methods. Recently, multiple works have sought to fuse TD learning with Stochastic Variance Reduced Gradient (SVRG) method to achieve a geometric rate of convergence. However, the resulting convergence rate is significantly weaker than what is achieved by SVRG in the setting of convex optimization. In this work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG. Our main result is a geometric convergence bound with predetermined learning rate of $1/8$, which is identical to the convergence bound available for SVRG in the convex setting. Our theoretical findings are supported by a set of experiments.

OCApr 14
Network Epidemic Control via Model Predictive Control

Mahtab Talaei, Alex Olshevsky, Laura F. White et al.

Non-pharmaceutical interventions are critical for epidemic suppression but impose substantial societal costs, motivating feedback control policies that adapt to time-varying transmission. We formulate an infinite-horizon optimal control problem for a mobility-coupled networked SIQR epidemic model that minimizes isolation burden while enforcing epidemic suppression through a spectral decay condition. From this formulation, we derive a safety-critical Model Predictive Control (MPC) framework in which the spectral certificate is imposed as a hard stage-wise constraint, yielding a tunable exponential decay rate for infections. Exploiting the monotone depletion of susceptible populations, we construct a robust terminal set and safe backup policy. This structure ensures recursive feasibility and finite-horizon closed-loop exponential decay, and it certifies the existence of a globally stabilizing feasible continuation under bounded worst-case transmission rates. Numerical simulations on a 14-county Massachusetts network under a variant-induced surge show that, with administrative rate limits, reactive myopic control fails whereas MPC anticipates the shock and maintains exponential decay with lower isolation burden.

LGJul 9, 2024
MDP Geometry, Normalization and Reward Balancing Solvers

Arsenii Mustafin, Aleksei Pakharev, Alex Olshevsky et al.

We present a new geometric interpretation of Markov Decision Processes (MDPs) with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy. This advantage-preserving transformation of the MDP motivates a class of algorithms which we call Reward Balancing, which solve MDPs by iterating through these transformations, until an approximately optimal policy can be trivially found. We provide a convergence analysis of several algorithms in this class, in particular showing that for MDPs for unknown transition probabilities we can improve upon state-of-the-art sample complexity results.

SYJul 27, 2024
Network-Based Epidemic Control Through Optimal Travel and Quarantine Management

Mahtab Talaei, Apostolos I. Rikos, Alex Olshevsky et al.

Motivated by the swift global transmission of infectious diseases, we present a comprehensive framework for network-based epidemic control. Our aim is to curb epidemics using two different approaches. In the first approach, we introduce an optimization strategy that optimally reduces travel rates. We analyze the convergence of this strategy and show that it hinges on the network structure to minimize infection spread. In the second approach, we expand the classic SIR model by incorporating and optimizing quarantined states to strategically contain the epidemic. We show that this problem reduces to the problem of matrix balancing. We establish a link between optimization constraints and the epidemic's reproduction number, highlighting the relationship between network structure and disease dynamics. We demonstrate that applying augmented primal-dual gradient dynamics to the optimal quarantine problem ensures exponential convergence to the KKT point. We conclude by validating our approaches using simulation studies that leverage public data from counties in the state of Massachusetts.

LGMay 3
Bridging the Gap Between Average and Discounted TD Learning

Haoxing Tian, Zaiwei Chen, Ioannis Ch. Paschalidis et al.

The analysis of Temporal Difference (TD) learning in the average-reward setting faces notable theoretical difficulties because the Bellman operator is not contractive with respect to any norm. This complicates standard analyses of stochastic updates that are effective in discounted settings. Although a considerable body of literature addresses these challenges, existing theoretical approaches come with limitations. We introduce a novel algorithm designed explicitly for policy evaluation in the average-reward setting, utilizing sampling from two Markovian trajectories. Our proposed method overcomes previous limitations by guaranteeing convergence to the unique solution of a properly defined projected Bellman equation. Notably, and in contrast to earlier work, our convergence analysis is uniformly applicable to both linear function approximation and tabular settings and does not involve explicit dimension-dependent terms in its convergence bounds. These results align with what is known to hold in the discounted setting. Furthermore, our algorithm achieves improved dependence on the problem's condition number, reducing the sample complexity from quartic, as in prior literature, to quadratic scaling, and thus matching the efficiency seen in the discounted setting.

LGApr 30
Data Deletion Can Help in Adaptive RL

Param Budhraja, Aditya Gangrade, Alex Olshevsky et al.

Deploying reinforcement learning policies in the real world requires adapting to time-varying environments. We study this problem in the contextual Markov Decision Process (cMDP) framework, where a family of environments is indexed by a low-dimensional context unknown at test time. The standard approach decomposes the problem: train a so-called "universal policy" which assumes knowledge of the true context, then pair it with a context estimator which approximates context using the observed trajectory. We identify a simple, counterintuitive trick that substantially improves the estimator: randomly delete a fraction of the training buffer after each round. This works because data is collected across multiple rounds using progressively better policies, and older trajectories come from a different distribution than what the estimator will face at deployment time; random deletion creates an implicit exponential decay on older data while preserving diversity without requiring any explicit identification of which samples are stale. This reduces robustness gap by 30% for MLPs and by 6% on average for recurrent networks. Strikingly, it allows a narrow MLP with 5x fewer parameters to outperform a wide MLP trained without deletion. To understand when and why deletion helps, we analyze regularized empirical risk minimization with a mismatch between the train distribution and the distribution at deployment; in this idealized setting, we prove that removing a single uniformly random training point decreases expected test loss in expectation under mild conditions. For ridge regression we make this quantitative: deletion helps when the regularization coefficient is moderate and the signal-to-noise ratio (SNR) is sufficiently low, and, crucially, this SNR threshold gives a direct measure of how large the distribution mismatch between training and deployment must be for deletion to be beneficial.

LGDec 8, 2023
On the Performance of Temporal Difference Learning With Neural Networks

Haoxing Tian, Ioannis Ch. Paschalidis, Alex Olshevsky

Neural Temporal Difference (TD) Learning is an approximate temporal difference method for policy evaluation that uses a neural network for function approximation. Analysis of Neural TD Learning has proven to be challenging. In this paper we provide a convergence analysis of Neural TD Learning with a projection onto $B(θ_0, ω)$, a ball of fixed radius $ω$ around the initial point $θ_0$. We show an approximation bound of $O(ε) + \tilde{O} (1/\sqrt{m})$ where $ε$ is the approximation quality of the best neural network in $B(θ_0, ω)$ and $m$ is the width of all hidden layers in the network.

SYApr 16, 2024
Sample Complexity of the Linear Quadratic Regulator: A Reinforcement Learning Lens

Amirreza Neshaei Moghaddam, Alex Olshevsky, Bahman Gharesifard

We provide the first known algorithm that provably achieves $\varepsilon$-optimality within $\widetilde{\mathcal{O}}(1/\varepsilon)$ function evaluations for the discounted discrete-time LQR problem with unknown parameters, without relying on two-point gradient estimates. These estimates are known to be unrealistic in many settings, as they depend on using the exact same initialization, which is to be selected randomly, for two different policies. Our results substantially improve upon the existing literature outside the realm of two-point gradient estimates, which either leads to $\widetilde{\mathcal{O}}(1/\varepsilon^2)$ rates or heavily relies on stability assumptions.

LGMar 13, 2024
One-Shot Averaging for Distributed TD($λ$) Under Markov Sampling

Haoxing Tian, Ioannis Ch. Paschalidis, Alex Olshevsky

We consider a distributed setup for reinforcement learning, where each agent has a copy of the same Markov Decision Process but transitions are sampled from the corresponding Markov chain independently by each agent. We show that in this setting, we can achieve a linear speedup for TD($λ$), a family of popular methods for policy evaluation, in the sense that $N$ agents can evaluate a policy $N$ times faster provided the target accuracy is small enough. Notably, this speedup is achieved by ``one shot averaging,'' a procedure where the agents run TD($λ$) with Markov sampling independently and only average their results after the final step. This significantly reduces the amount of communication required to achieve a linear speedup relative to previous work.

LGMar 6, 2025
Geometric Re-Analysis of Classical MDP Solving Algorithms

Arsenii Mustafin, Aleksei Pakharev, Alex Olshevsky et al.

We build on a recently introduced geometric interpretation of Markov Decision Processes (MDPs) to analyze classical MDP-solving algorithms: Value Iteration (VI) and Policy Iteration (PI). First, we develop a geometry-based analytical apparatus, including a transformation that modifies the discount factor $γ$, to improve convergence guarantees for these algorithms in several settings. In particular, one of our results identifies a rotation component in the VI method, and as a consequence shows that when a Markov Reward Process (MRP) induced by the optimal policy is irreducible and aperiodic, the asymptotic convergence rate of value iteration is strictly smaller than $γ$.

OCFeb 20, 2025
Sample Complexity of Linear Quadratic Regulator Without Initial Stability

Amirreza Neshaei Moghaddam, Alex Olshevsky, Bahman Gharesifard

Inspired by REINFORCE, we introduce a novel receding-horizon algorithm for the Linear Quadratic Regulator (LQR) problem with unknown dynamics. Unlike prior methods, our algorithm avoids reliance on two-point gradient estimates while maintaining the same order of sample complexity. Furthermore, it eliminates the restrictive requirement of starting with a stable initial policy, broadening its applicability. Beyond these improvements, we introduce a refined analysis of error propagation through the contraction of the Riccati operator under the Riemannian distance. This refinement leads to a better sample complexity and ensures improved convergence guarantees.

LGJan 8, 2024
Convex SGD: Generalization Without Early Stopping

Julien Hendrickx, Alex Olshevsky

We consider the generalization error associated with stochastic gradient descent on a smooth convex function over a compact set. We show the first bound on the generalization error that vanishes when the number of iterations $T$ and the dataset size $n$ go to zero at arbitrary rates; our bound scales as $\tilde{O}(1/\sqrt{T} + 1/\sqrt{n})$ with step-size $α_t = 1/\sqrt{t}$. In particular, strong convexity is not needed for stochastic gradient descent to generalize well.

LGFeb 5, 2025
Analysis of Value Iteration Through Absolute Probability Sequences

Arsenii Mustafin, Sebastien Colla, Alex Olshevsky et al.

Value Iteration is a widely used algorithm for solving Markov Decision Processes (MDPs). While previous studies have extensively analyzed its convergence properties, they primarily focus on convergence with respect to the infinity norm. In this work, we use absolute probability sequences to develop a new line of analysis and examine the algorithm's convergence in terms of the $L^2$ norm, offering a new perspective on its behavior and performance.

MAJun 14, 2024
Tree Search for Simultaneous Move Games via Equilibrium Approximation

Ryan Yu, Alex Olshevsky, Peter Chin

Neural network supported tree-search has shown strong results in a variety of perfect information multi-agent tasks. However, the performance of these methods on partial information games has generally been below competing approaches. Here we study the class of simultaneous-move games, which are a subclass of partial information games which are most similar to perfect information games: both agents know the game state with the exception of the opponent's move, which is revealed only after each agent makes its own move. Simultaneous move games include popular benchmarks such as Google Research Football and Starcraft. In this study we answer the question: can we take tree search algorithms trained through self-play from perfect information settings and adapt them to simultaneous move games without significant loss of performance? We answer this question by deriving a practical method that attempts to approximate a coarse correlated equilibrium as a subroutine within a tree search. Our algorithm works on cooperative, competitive, and mixed tasks. Our results are better than the current best MARL algorithms on a wide range of accepted baseline environments.

LGJun 13, 2024
On Value Iteration Convergence in Connected MDPs

Arsenii Mustafin, Alex Olshevsky, Ioannis Ch. Paschalidis

This paper establishes that an MDP with a unique optimal policy and ergodic associated transition matrix ensures the convergence of various versions of the Value Iteration algorithm at a geometric rate that exceeds the discount factor γ for both discounted and average-reward criteria.

LGMay 25, 2023
Distributed TD(0) with Almost No Communication

Rui Liu, Alex Olshevsky

We provide a new non-asymptotic analysis of distributed temporal difference learning with linear function approximation. Our approach relies on ``one-shot averaging,'' where $N$ agents run identical local copies of the TD(0) method and average the outcomes only once at the very end. We demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0). This is the first result proving benefits from parallelism for temporal difference methods.

DCJun 9, 2021
Communication-efficient SGD: From Local SGD to One-Shot Averaging

Artin Spiridonoff, Alex Olshevsky, Ioannis Ch. Paschalidis

We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among $N$ workers, who can take SGD steps and coordinate with a central server. While it is possible to obtain a linear reduction in the variance by averaging all the stochastic gradients at every step, this requires a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs $Ω( \sqrt{T} )$ communications for $T$ local gradient steps in order for the error to scale proportionately to $1/(NT)$, this has been successively improved in a string of papers, with the state of the art requiring $Ω\left( N \left( \mbox{ poly} (\log T) \right) \right)$ communications. In this paper, we suggest a Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows. Our analysis shows that this can achieve an error that scales as $1/(NT)$ with a number of communications that is completely independent of $T$. In particular, we show that $Ω(N)$ communications are sufficient. Empirical evidence suggests this bound is close to tight as we further show that $\sqrt{N}$ or $N^{3/4}$ communications fail to achieve linear speed-up in simulations. Moreover, we show that under mild assumptions, the main of which is twice differentiability on any neighborhood of the optimal solution, one-shot averaging which only uses a single round of communication can also achieve the optimal convergence rate asymptotically.

LGApr 16, 2021
Distributed TD(0) with Almost No Communication

Rui Liu, Alex Olshevsky

We provide a new non-asymptotic analysis of distributed TD(0) with linear function approximation. Our approach relies on "one-shot averaging," where $N$ agents run local copies of TD(0) and average the outcomes only once at the very end. We consider two models: one in which the agents interact with an environment they can observe and whose transitions depends on all of their actions (which we call the global state model), and one in which each agent can run a local copy of an identical Markov Decision Process, which we call the local state model. In the global state model, we show that the convergence rate of our distributed one-shot averaging method matches the known convergence rate of TD(0). By contrast, the best convergence rate in the previous literature showed a rate which, according to the worst-case bounds given, could underperform the non-distributed version by $O(N^3)$ in terms of the number of agents $N$. In the local state model, we demonstrate a version of the linear time speedup phenomenon, where the convergence time of the distributed process is a factor of $N$ faster than the convergence time of TD(0). As far as we are aware, this is the first result rigorously showing benefits from parallelism for temporal difference methods.

LGOct 27, 2020
Temporal Difference Learning as Gradient Splitting

Rui Liu, Alex Olshevsky

Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We give a new interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a new, fuller explanation of why temporal difference works, our interpretation also yields improved convergence times. We consider the setting with $1/\sqrt{T}$ step-size, where previous comparable finite-time convergence time bounds for temporal difference learning had the multiplicative factor $1/(1-γ)$ in front of the bound, with $γ$ being the discount factor. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where $1/(1-γ)$ only multiplies an asymptotically negligible term.

LGOct 23, 2020
Adversarial Crowdsourcing Through Robust Rank-One Matrix Completion

Qianqian Ma, Alex Olshevsky

We consider the problem of reconstructing a rank-one matrix from a revealed subset of its entries when some of the revealed entries are corrupted with perturbations that are unknown and can be arbitrarily large. It is not known which revealed entries are corrupted. We propose a new algorithm combining alternating minimization with extreme-value filtering and provide sufficient and necessary conditions to recover the original rank-one matrix. In particular, we show that our proposed algorithm is optimal when the set of revealed entries is given by an Erdős-Rényi random graph. These results are then applied to the problem of classification from crowdsourced data under the assumption that while the majority of the workers are governed by the standard single-coin David-Skene model (i.e., they output the correct answer with a certain probability), some of the workers can deviate arbitrarily from this model. In particular, the "adversarial" workers could even make decisions designed to make the algorithm output an incorrect answer. Extensive experimental results show our algorithm for this problem, based on rank-one matrix completion with perturbations, outperforms all other state-of-the-art methods in such an adversarial scenario.

LGAug 11, 2020
Asymptotic Convergence Rate of Alternating Minimization for Rank One Matrix Completion

Rui Liu, Alex Olshevsky

We study alternating minimization for matrix completion in the simplest possible setting: completing a rank-one matrix from a revealed subset of the entries. We bound the asymptotic convergence rate by the variational characterization of eigenvalues of a reversible consensus problem. This leads to a polynomial upper bound on the asymptotic rate in terms of number of nodes as well as the largest degree of the graph of revealed entries.

OCJun 3, 2020
Local SGD With a Communication Overhead Depending Only on the Number of Workers

Artin Spiridonoff, Alex Olshevsky, Ioannis Ch. Paschalidis

We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among $n$ workers, who can take SGD steps and coordinate with a central server. Unfortunately, this could require a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs $Ω( \sqrt{T} )$ communications for $T$ local gradient steps in order for the error to scale proportionately to $1/(nT)$, this has been successively improved in a string of papers, with the state-of-the-art requiring $Ω\left( n \left( \mbox{ polynomial in log } (T) \right) \right)$ communications. In this paper, we give a new analysis of Local SGD. A consequence of our analysis is that Local SGD can achieve an error that scales as $1/(nT)$ with only a fixed number of communications independent of $T$: specifically, only $Ω(n)$ communications are required.

OCJun 28, 2019
Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning

Shi Pu, Alex Olshevsky, Ioannis Ch. Paschalidis

We provide a discussion of several recent results which, in certain scenarios, are able to overcome a barrier in distributed stochastic optimization for machine learning. Our focus is the so-called asymptotic network independence property, which is achieved whenever a distributed method executed over a network of n nodes asymptotically converges to the optimal solution at a comparable rate to a centralized method with the same computational power as the entire network. We explain this property through an example involving the training of ML models and sketch a short mathematical analysis for comparing the performance of distributed stochastic gradient descent (DSGD) with centralized stochastic gradient decent (SGD).

OCJun 6, 2019
A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent

Shi Pu, Alex Olshevsky, Ioannis Ch. Paschalidis

This paper is concerned with minimizing the average of $n$ cost functions over a network in which agents may communicate and exchange information with each other. We consider the setting where only noisy gradient information is available. To solve the problem, we study the distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate, which we show behaves as $K_T=\mathcal{O}\left(\frac{n}{(1-ρ_w)^2}\right)$, where $1-ρ_w$ denotes the spectral gap of the mixing matrix. Moreover, we construct a "hard" optimization problem for which we show the transient time needed for DSGD to approach the asymptotic convergence rate is lower bounded by $Ω\left(\frac{n}{(1-ρ_w)^2} \right)$, implying the sharpness of the obtained result. Numerical experiments demonstrate the tightness of the theoretical results.

HCApr 25, 2019
Gradient Descent for Sparse Rank-One Matrix Completion for Crowd-Sourced Aggregation of Sparsely Interacting Workers

Yao Ma, Alex Olshevsky, Venkatesh Saligrama et al.

We consider worker skill estimation for the single-coin Dawid-Skene crowdsourcing model. In practice, skill-estimation is challenging because worker assignments are sparse and irregular due to the arbitrary and uncontrolled availability of workers. We formulate skill estimation as a rank-one correlation-matrix completion problem, where the observed components correspond to observed label correlations between workers. We show that the correlation matrix can be successfully recovered and skills are identifiable if and only if the sampling matrix (observed components) does not have a bipartite connected component. We then propose a projected gradient descent scheme and show that skill estimates converge to the desired global optima for such sampling matrices. Our proof is original and the results are surprising in light of the fact that even the weighted rank-one matrix factorization problem is NP-hard in general. Next, we derive sample complexity bounds in terms of spectral properties of the signless Laplacian of the sampling matrix. Our proposed scheme achieves state-of-art performance on a number of real-world datasets.

LGFeb 1, 2019
Graph Resistance and Learning from Pairwise Comparisons

Julien M. Hendrickx, Alex Olshevsky, Venkatesh Saligrama

We consider the problem of learning the qualities of a collection of items by performing noisy comparisons among them. Following the standard paradigm, we assume there is a fixed "comparison graph" and every neighboring pair of items in this graph is compared $k$ times according to the Bradley-Terry-Luce model (where the probability than an item wins a comparison is proportional the item quality). We are interested in how the relative error in quality estimation scales with the comparison graph in the regime where $k$ is large. We prove that, after a known transition period, the relevant graph-theoretic quantity is the square root of the resistance of the comparison graph. Specifically, we provide an algorithm that is minimax optimal. The algorithm has a relative error decay that scales with the square root of the graph resistance, and provide a matching lower bound (up to log factors). The performance guarantee of our algorithm, both in terms of the graph and the skewness of the item quality distribution, outperforms earlier results.

OCApr 11, 2019
On the Inapproximability of the Discrete Witsenhausen Problem

Alex Olshevsky

We consider a discrete version of the Witsenhausen problem where all random variables are bounded and take on integer values. Our main goal is to understand the complexity of computing good strategies given the distributions for the initial state and second-stage noise as inputs to the problem. Following Papadimitriou and Tsitsiklis [1], who showed that computing the optimal solution is NP-complete, we construct a sequence of problem instances with the initial state uniform over a set of size $n$ and the noise uniform over a set of size at most $n^2$, such that finding a strategy whose cost is a multiplicative $n^{2-ε}$ approximation to the optimal cost is NP-hard for any $ε> 0$.

LGJun 20, 2017
Crowdsourcing with Sparsely Interacting Workers

Yao Ma, Alex Olshevsky, Venkatesh Saligrama et al.

We consider estimation of worker skills from worker-task interaction data (with unknown labels) for the single-coin crowd-sourcing binary classification model in symmetric noise. We define the (worker) interaction graph whose nodes are workers and an edge between two nodes indicates whether or not the two workers participated in a common task. We show that skills are asymptotically identifiable if and only if an appropriate limiting version of the interaction graph is irreducible and has odd-cycles. We then formulate a weighted rank-one optimization problem to estimate skills based on observations on an irreducible, aperiodic interaction graph. We propose a gradient descent scheme and show that for such interaction graphs estimates converge asymptotically to the global minimum. We characterize noise robustness of the gradient scheme in terms of spectral properties of signless Laplacians of the interaction graph. We then demonstrate that a plug-in estimator based on the estimated skills achieves state-of-art performance on a number of real-world datasets. Our results have implications for rank-one matrix completion problem in that gradient descent can provably recover $W \times W$ rank-one matrices based on $W+1$ off-diagonal observations of a connected graph with a single odd-cycle.

OCAug 4, 2017
Linear Time Average Consensus on Fixed Graphs and Implications for Decentralized Optimization and Multi-Agent Control

Alex Olshevsky

We describe a protocol for the average consensus problem on any fixed undirected graph whose convergence time scales linearly in the total number nodes $n$. The protocol is completely distributed, with the exception of requiring all nodes to know the same upper bound $U$ on the total number of nodes which is correct within a constant multiplicative factor. We next discuss applications of this protocol to problems in multi-agent control connected to the consensus problem. In particular, we describe protocols for formation maintenance and leader-following with convergence times which also scale linearly with the number of nodes. Finally, we develop a distributed protocol for minimizing an average of (possibly nondifferentiable) convex functions $ (1/n) \sum_{i=1}^n f_i(θ)$, in the setting where only node $i$ in an undirected, connected graph knows the function $f_i(θ)$. Under the same assumption about all nodes knowing $U$, and additionally assuming that the subgradients of each $f_i(θ)$ have absolute values upper bounded by some constant $L$ known to the nodes, we show that after $T$ iterations our protocol has error which is $O(L \sqrt{n/T})$.

OCApr 10, 2017
Distributed Learning for Cooperative Inference

Angelia Nedić, Alex Olshevsky, César A. Uribe

We study the problem of cooperative inference where a group of agents interact over a network and seek to estimate a joint parameter that best explains a set of observations. Agents do not know the network topology or the observations of other agents. We explore a variational interpretation of the Bayesian posterior density, and its relation to the stochastic mirror descent algorithm, to propose a new distributed learning algorithm. We show that, under appropriate assumptions, the beliefs generated by the proposed algorithm concentrate around the true parameter exponentially fast. We provide explicit non-asymptotic bounds for the convergence rate. Moreover, we develop explicit and computationally efficient algorithms for observation models belonging to exponential families.

OCDec 6, 2016
Distributed Gaussian Learning over Time-varying Directed Graphs

Angelia Nedić, Alex Olshevsky, César A. Uribe

We present a distributed (non-Bayesian) learning algorithm for the problem of parameter estimation with Gaussian noise. The algorithm is expressed as explicit updates on the parameters of the Gaussian beliefs (i.e. means and precision). We show a convergence rate of $O(1/k)$ with the constant term depending on the number of agents and the topology of the network. Moreover, we show almost sure convergence to the optimal solution of the estimation problem for the general case of time-varying directed graphs.

OCSep 23, 2016
A Tutorial on Distributed (Non-Bayesian) Learning: Problem, Algorithms and Results

Angelia Nedić, Alex Olshevsky, César A. Uribe

We overview some results on distributed learning with focus on a family of recently proposed algorithms known as non-Bayesian social learning. We consider different approaches to the distributed learning problem and its algorithmic solutions for the case of finitely many hypotheses. The original centralized problem is discussed at first, and then followed by a generalization to the distributed setting. The results on convergence and convergence rate are presented for both asymptotic and finite time regimes. Various extensions are discussed such as those dealing with directed time-varying networks, Nesterov's acceleration technique and a continuum sets of hypothesis.

OCSep 19, 2016
Geometrically Convergent Distributed Optimization with Uncoordinated Step-Sizes

Angelia Nedić, Alex Olshevsky, Wei Shi et al.

A recent algorithmic family for distributed optimization, DIGing's, have been shown to have geometric convergence over time-varying undirected/directed graphs. Nevertheless, an identical step-size for all agents is needed. In this paper, we study the convergence rates of the Adapt-Then-Combine (ATC) variation of the DIGing algorithm under uncoordinated step-sizes. We show that the ATC variation of DIGing algorithm converges geometrically fast even if the step-sizes are different among the agents. In addition, our analysis implies that the ATC structure can accelerate convergence compared to the distributed gradient descent (DGD) structure which has been used in the original DIGing algorithm.

DSAug 10, 2016
On symmetric continuum opinion dynamics

Julien M. Hendrickx, Alex Olshevsky

This paper investigates the asymptotic behavior of some common opinion dynamic models in a continuum of agents. We show that as long as the interactions among the agents are symmetric, the distribution of the agents' opinion converges. We also investigate whether convergence occurs in a stronger sense than merely in distribution, namely, whether the opinion of almost every agent converges. We show that while this is not the case in general, it becomes true under plausible assumptions on inter-agent interactions, namely that agents with similar opinions exert a non-negligible pull on each other, or that the interactions are entirely determined by their opinions via a smooth function.

OCMay 6, 2016
Distributed Learning with Infinitely Many Hypotheses

Angelia Nedić, Alex Olshevsky, César Uribe

We consider a distributed learning setup where a network of agents sequentially access realizations of a set of random variables with unknown distributions. The network objective is to find a parametrized distribution that best describes their joint observations in the sense of the Kullback-Leibler divergence. Apart from recent efforts in the literature, we analyze the case of countably many hypotheses and the case of a continuum of hypotheses. We provide non-asymptotic bounds for the concentration rate of the agents' beliefs around the correct hypothesis in terms of the number of agents, the network parameters, and the learning abilities of the agents. Additionally, we provide a novel motivation for a general set of distributed Non-Bayesian update rules as instances of the distributed stochastic mirror descent algorithm.

OCSep 11, 2012
Cooperative learning in multi-agent systems from intermittent measurements

Naomi Ehrich Leonard, Alex Olshevsky

Motivated by the problem of tracking a direction in a decentralized way, we consider the general problem of cooperative learning in multi-agent systems with time-varying connectivity and intermittent measurements. We propose a distributed learning protocol capable of learning an unknown vector $μ$ from noisy measurements made independently by autonomous nodes. Our protocol is completely distributed and able to cope with the time-varying, unpredictable, and noisy nature of inter-agent communication, and intermittent noisy measurements of $μ$. Our main result bounds the learning speed of our protocol in terms of the size and combinatorial features of the (time-varying) networks connecting the nodes.

OCJan 14, 2009
Convergence Speed in Distributed Consensus and Control

Alex Olshevsky, John N. Tsitsiklis

We study the convergence speed of distributed iterative algorithms for the consensus and averaging problems, with emphasis on the latter. We first consider the case of a fixed communication topology. We show that a simple adaptation of a consensus algorithm leads to an averaging algorithm. We prove lower bounds on the worst-case convergence time for various classes of linear, time-invariant, distributed consensus methods, and provide an algorithm that essentially matches those lower bounds. We then consider the case of a time-varying topology, and provide a polynomial-time averaging algorithm.