Ali Jadbabaie

h-index67

96papers

5,723citations

Novelty56%

AI Score55

Ranked #23,374 of 205,806 authors (top 11%)#5,315 in LG (top 13%)

96 Papers

OCMay 14, 2020

Sensor Placement for Optimal Kalman Filtering: Fundamental Limits, Submodularity, and Algorithms

Vasileios Tzoumas, Ali Jadbabaie, George J. Pappas

In this paper, we focus on sensor placement in linear dynamic estimation, where the objective is to place a small number of sensors in a system of interdependent states so to design an estimator with a desired estimation performance. In particular, we consider a linear time-variant system that is corrupted with process and measurement noise, and study how the selection of its sensors affects the estimation error of the corresponding Kalman filter over a finite observation interval. Our contributions are threefold: First, we prove that the minimum mean square error of the Kalman filter decreases only linearly as the number of sensors increases. That is, adding extra sensors so to reduce this estimation error is ineffective, a fundamental design limit. Similarly, we prove that the number of sensors grows linearly with the system's size for fixed minimum mean square error and number of output measurements over an observation interval; this is another fundamental limit, especially for systems where the system's size is large. Second, we prove that the logdet of the error covariance of the Kalman filter, which captures the volume of the corresponding confidence ellipsoid, with respect to the system's initial condition and process noise is a supermodular and non-increasing set function in the choice of the sensor set. Therefore, it exhibits the diminishing returns property. Third, we provide efficient approximation algorithms that select a small number sensors so to optimize the Kalman filter with respect to this estimation error ---the worst-case performance guarantees of these algorithms are provided as well. Finally, we illustrate the efficiency of our algorithms using the problem of surface-based monitoring of CO2 sequestration sites studied in Weimer et al. (2008).

OCApr 22, 2011

A Distributed Newton Method for Network Utility Maximization

Ermin Wei, Asuman Ozdaglar, Ali Jadbabaie

Most existing work uses dual decomposition and subgradient methods to solve Network Utility Maximization (NUM) problems in a distributed manner, which suffer from slow rate of convergence properties. This work develops an alternative distributed Newton-type fast converging algorithm for solving network utility maximization problems with self-concordant utility functions. By using novel matrix splitting techniques, both primal and dual updates for the Newton step can be computed using iterative schemes in a decentralized manner with limited information exchange. Similarly, the stepsize can be obtained via an iterative consensus-based averaging scheme. We show that even when the Newton direction and the stepsize in our method are computed within some error (due to finite truncation of the iterative schemes), the resulting objective function value still converges superlinearly to an explicitly characterized error neighborhood. Simulation results demonstrate significant convergence rate improvement of our algorithm relative to the existing subgradient methods based on dual decomposition.

OCOct 31, 2017

Resilient Monotone Submodular Function Maximization

Vasileios Tzoumas, Konstantinos Gatsis, Ali Jadbabaie et al.

In this paper, we focus on applications in machine learning, optimization, and control that call for the resilient selection of a few elements, e.g. features, sensors, or leaders, against a number of adversarial denial-of-service attacks or failures. In general, such resilient optimization problems are hard, and cannot be solved exactly in polynomial time, even though they often involve objective functions that are monotone and submodular. Notwithstanding, in this paper we provide the first scalable, curvature-dependent algorithm for their approximate solution, that is valid for any number of attacks or failures, and which, for functions with low curvature, guarantees superior approximation performance. Notably, the curvature has been known to tighten approximations for several non-resilient maximization problems, yet its effect on resilient maximization had hitherto been unknown. We complement our theoretical analyses with supporting empirical evaluations.

OCApr 27, 2023

Convergence of Adam Under Relaxed Assumptions

Haochuan Li, Alexander Rakhlin, Ali Jadbabaie

In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to $ε$-stationary points with ${O}(ε^{-4})$ gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of ${O}(ε^{-3})$.

SYMar 16, 2015

Minimal Actuator Placement with Optimal Control Constraints

Vasileios Tzoumas, Mohammad Amin Rahimian, George J. Pappas et al.

We introduce the problem of minimal actuator placement in a linear control system so that a bound on the minimum control effort for a given state transfer is satisfied while controllability is ensured. We first show that this is an NP-hard problem following the recent work of Olshevsky. Next, we prove that this problem has a supermodular structure. Afterwards, we provide an efficient algorithm that approximates up to a multiplicative factor of O(logn), where n is the size of the multi-agent network, any optimal actuator set that meets the specified energy criterion. Moreover, we show that this is the best approximation factor one can achieve in polynomial-time for the worst case. Finally, we test this algorithm over large Erdos-Renyi random networks to further demonstrate its efficiency.

LGOct 2, 2023

Linear attention is (maybe) all you need (to understand transformer optimization)

Kwangjun Ahn, Xiang Cheng, Minhak Song et al.

Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and K.~Ahn et al.~(NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization.

SYFeb 20, 2018

Minimal Reachability is Hard To Approximate

Ali Jadbabaie, Alexander Olshevsky, George J. Pappas et al.

In this note, we consider the problem of choosing which nodes of a linear dynamical system should be actuated so that the state transfer from the system's initial condition to a given final state is possible. Assuming a standard complexity hypothesis, we show that this problem cannot be efficiently solved or approximated in polynomial, or even quasi-polynomial, time.

OCJun 2, 2023

Convex and Non-convex Optimization Under Generalized Smoothness

Haochuan Li, Jian Qian, Yi Tian et al.

Classical analysis of convex and non-convex optimization methods often requires the Lipshitzness of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.

LGDec 21, 2022

A Non-Asymptotic Analysis of Oversmoothing in Graph Neural Networks

Xinyi Wu, Zhengdao Chen, William Wang et al.

Oversmoothing is a central challenge of building more powerful Graph Neural Networks (GNNs). While previous works have only demonstrated that oversmoothing is inevitable when the number of graph convolutions tends to infinity, in this paper, we precisely characterize the mechanism behind the phenomenon via a non-asymptotic analysis. Specifically, we distinguish between two different effects when applying graph convolutions -- an undesirable mixing effect that homogenizes node representations in different classes, and a desirable denoising effect that homogenizes node representations in the same class. By quantifying these two effects on random graphs sampled from the Contextual Stochastic Block Model (CSBM), we show that oversmoothing happens once the mixing effect starts to dominate the denoising effect, and the number of layers required for this transition is $O(\log N/\log (\log N))$ for sufficiently dense graphs with $N$ nodes. We also extend our analysis to study the effects of Personalized PageRank (PPR), or equivalently, the effects of initial residual connections on oversmoothing. Our results suggest that while PPR mitigates oversmoothing at deeper layers, PPR-based architectures still achieve their best performance at a shallow depth and are outperformed by the graph convolution approach on certain graphs. Finally, we support our theoretical results with numerical experiments, which further suggest that the oversmoothing phenomenon observed in practice can be magnified by the difficulty of optimizing deep GNN models.

SYJun 3, 2020

Deterministic and Randomized Actuator Scheduling With Guaranteed Performance Bounds

Milad Siami, Alex Olshevsky, Ali Jadbabaie

In this paper, we investigate the problem of actuator selection for linear dynamical systems. We develop a framework to design a sparse actuator schedule for a given large-scale linear system with guaranteed performance bounds using deterministic polynomial-time and randomized approximately linear-time algorithms. First, we introduce systemic controllability metrics for linear dynamical systems that are monotone and homogeneous with respect to the controllability Gramian. We show that several popular and widely used optimization criteria in the literature belong to this class of controllability metrics. Our main result is to provide a polynomial-time actuator schedule that on average selects only a constant number of actuators at each time step, independent of the dimension, to furnish a guaranteed approximation of the controllability metrics in comparison to when all actuators are in use. Our results naturally apply to the dual problem of sensor selection, in which we provide a guaranteed approximation to the observability Gramian. We illustrate the effectiveness of our theoretical findings via several numerical simulations using benchmark examples.

SYOct 6, 2014

Controllability and Fraction of Leaders in Infinite Network

Chinwendu Enyioha, Mohammad Amin Rahimian, George J. Pappas et al.

In this paper, we study controllability of a network of linear single-integrator agents when the network size goes to infinity. We first investigate the effect of increasing size by injecting an input at every node and requiring that network controllability Gramian remain well-conditioned with the increasing dimension. We provide theoretical justification to the intuition that high degree nodes pose a challenge to network controllability. In particular, the controllability Gramian for the networks with bounded maximum degrees is shown to remain well-conditioned even as the network size goes to infinity. In the canonical cases of star, chain and ring networks, we also provide closed-form expressions which bound the condition number of the controllability Gramian in terms of the network size. We next consider the effect of the choice and number of leader nodes by actuating only a subset of nodes and considering the least eigenvalue of the Gramian as the network size increases. Accordingly, while a directed star topology can never be made controllable for all sizes by injecting an input just at a fraction $f<1$ of nodes; for path or cycle networks, the designer can actuate a non-zero fraction of nodes and spread them throughout the network in such way that the least eigenvalue of the Gramians remain bounded away from zero with the increasing size. The results offer interesting insights on the challenges of control in large networks and with high-degree nodes.

SISep 11, 2012

Moment-Based Spectral Analysis of Large-Scale Networks Using Local Structural Information

Victor M. Preciado, Ali Jadbabaie

The eigenvalues of matrices representing the structure of large-scale complex networks present a wide range of applications, from the analysis of dynamical processes taking place in the network to spectral techniques aiming to rank the importance of nodes in the network. A common approach to study the relationship between the structure of a network and its eigenvalues is to use synthetic random networks in which structural properties of interest, such as degree distributions, are prescribed. Although very common, synthetic models present two major flaws: (\emph{i}) These models are only suitable to study a very limited range of structural properties, and (\emph{ii}) they implicitly induce structural properties that are not directly controlled and can deceivingly influence the network eigenvalue spectrum. In this paper, we propose an alternative approach to overcome these limitations. Our approach is not based on synthetic models, instead, we use algebraic graph theory and convex optimization to study how structural properties influence the spectrum of eigenvalues of the network. Using our approach, we can compute with low computational overhead global spectral properties of a network from its local structural properties. We illustrate our approach by studying how structural properties of online social networks influence their eigenvalue spectra.

LGMar 2, 2023

Variance-reduced Clipping for Non-convex Optimization

Amirhossein Reisizadeh, Haochuan Li, Subhro Das et al.

Gradient clipping is a standard training technique used in deep learning applications such as large-scale language modeling to mitigate exploding gradients. Recent experimental studies have demonstrated a fairly special behavior in the smoothness of the training objective along its trajectory when trained with gradient clipping. That is, the smoothness grows with the gradient norm. This is in clear contrast to the well-established assumption in folklore non-convex optimization, a.k.a. $L$--smoothness, where the smoothness is assumed to be bounded by a constant $L$ globally. The recently introduced $(L_0,L_1)$--smoothness is a more relaxed notion that captures such behavior in non-convex optimization. In particular, it has been shown that under this relaxed smoothness assumption, SGD with clipping requires $O(ε^{-4})$ stochastic gradient computations to find an $ε$--stationary solution. In this paper, we employ a variance reduction technique, namely SPIDER, and demonstrate that for a carefully designed learning rate, this complexity is improved to $O(ε^{-3})$ which is order-optimal. Our designed learning rate comprises the clipping technique to mitigate the growing smoothness. Moreover, when the objective function is the average of $n$ components, we improve the existing $O(nε^{-2})$ bound on the stochastic gradient complexity to $O(\sqrt{n} ε^{-2} + n)$, which is order-optimal as well. In addition to being theoretically optimal, SPIDER with our designed parameters demonstrates comparable empirical performance against variance-reduced methods such as SVRG and SARAH in several vision tasks.

GTMar 4, 2015

Competitive Diffusion in Social Networks: Quality or Seeding?

Arastoo Fazeli, Amir Ajorlou, Ali Jadbabaie

In this paper, we study a strategic model of marketing and product consumption in social networks. We consider two firms in a market competing to maximize the consumption of their products. Firms have a limited budget which can be either invested on the quality of the product or spent on initial seeding in the network in order to better facilitate spread of the product. After the decision of firms, agents choose their consumptions following a myopic best response dynamics which results in a local, linear update for their consumption decision. We characterize the unique Nash equilibrium of the game between firms and study the effect of the budgets as well as the network structure on the optimal allocation. We show that at the equilibrium, firms invest more budget on quality when their budgets are close to each other. However, as the gap between budgets widens, competition in qualities becomes less effective and firms spend more of their budget on seeding. We also show that given equal budget of firms, if seeding budget is nonzero for a balanced graph, it will also be nonzero for any other graph, and if seeding budget is zero for a star graph it will be zero for any other graph as well. As a practical extension, we then consider a case where products have some preset qualities that can be only improved marginally. At some point in time, firms learn about the network structure and decide to utilize a limited budget to mount their market share by either improving the quality or new seeding some agents to incline consumers towards their products. We show that the optimal budget allocation in this case simplifies to a threshold strategy. Interestingly, we derive similar results to that of the original problem, in which preset qualities simulate the role that budgets had in the original setup.

SYFeb 1, 2013

Bayesian Quadratic Network Game Filters

Ceyhun Eksin, Pooya Molavi, Alejandro Ribeiro et al.

A repeated network game where agents have quadratic utilities that depend on information externalities -- an unknown underlying state -- as well as payoff externalities -- the actions of all other agents in the network -- is considered. Agents play Bayesian Nash Equilibrium strategies with respect to their beliefs on the state of the world and the actions of all other nodes in the network. These beliefs are refined over subsequent stages based on the observed actions of neighboring peers. This paper introduces the Quadratic Network Game (QNG) filter that agents can run locally to update their beliefs, select corresponding optimal actions, and eventually learn a sufficient statistic of the network's state. The QNG filter is demonstrated on a Cournot market competition game and a coordination game to implement navigation of an autonomous team.

OCSep 7, 2012

Structural Analysis of Laplacian Spectral Properties of Large-Scale Networks

Victor M. Preciado, Ali Jadbabaie, George C. Verghese

Using methods from algebraic graph theory and convex optimization, we study the relationship between local structural features of a network and spectral properties of its Laplacian matrix. In particular, we derive expressions for the so-called spectral moments of the Laplacian matrix of a network in terms of a collection of local structural measurements. Furthermore, we propose a series of semidefinite programs to compute bounds on the spectral radius and the spectral gap of the Laplacian matrix from a truncated sequence of Laplacian spectral moments. Our analysis shows that the Laplacian spectral moments and spectral radius are strongly constrained by local structural features of the network. On the other hand, we illustrate how local structural features are usually not enough to estimate the Laplacian spectral gap.

SYJan 8, 2015

Distributed Resource Allocation for Epidemic control

Chinwendu Enyioha, Ali Jadbabaie, Victor Preciado et al.

We present a distributed resource allocation strategy to control an epidemic outbreak in a networked population based on a Distributed Alternating Direction Method of Multipliers (D-ADMM) algorithm. We consider a linearized Susceptible- Infected-Susceptible (SIS) epidemic spreading model in which agents in the network are able to allocate vaccination resources (for prevention) and antidotes (for treatment) in the presence of a contagion. We express our epidemic control condition as a spectral constraint involving the Perron-Frobenius eigenvalue, and formulate the resource allocation problem as a Geometric Program (GP). Next, we separate the network-wide optimization problem into subproblems optimally solved by each agent in a fully distributed way. We conclude the paper by illustrating performance of our solution framework with numerical simulations.

OCMar 7, 2017

Scaling laws for consensus protocols subject to noise

Ali Jadbabaie, Alex Olshevsky

We study the performance of discrete-time consensus protocols in the presence of additive noise. When the consensus dynamic corresponds to a reversible Markov chain, we give an exact expression for a weighted version of steady-state disagreement in terms of the stationary distribution and hitting times in an underlying graph. We then show how this result can be used to characterize the noise robustness of a class of protocols for formation control in terms of the Kemeny constant of an underlying graph.

LGJun 6, 2022

An Optimal Transport Approach to Personalized Federated Learning

Farzan Farnia, Amirhossein Reisizadeh, Ramtin Pedarsani et al.

Federated learning is a distributed machine learning paradigm, which aims to train a model using the local data of many distributed clients. A key challenge in federated learning is that the data samples across the clients may not be identically distributed. To address this challenge, personalized federated learning with the goal of tailoring the learned model to the data distribution of every individual client has been proposed. In this paper, we focus on this problem and propose a novel personalized Federated Learning scheme based on Optimal Transport (FedOT) as a learning algorithm that learns the optimal transport maps for transferring data points to a common distribution as well as the prediction model under the applied transport map. To formulate the FedOT problem, we extend the standard optimal transport task between two probability distributions to multi-marginal optimal transport problems with the goal of transporting samples from multiple distributions to a common probability domain. We then leverage the results on multi-marginal optimal transport problems to formulate FedOT as a min-max optimization problem and analyze its generalization and optimization properties. We discuss the results of several numerical experiments to evaluate the performance of FedOT under heterogeneous data distributions in federated learning problems.

STNov 7, 2016

Bayesian Learning without Recall

M. Amin Rahimian, Ali Jadbabaie

We analyze a model of learning and belief formation in networks in which agents follow Bayes rule yet they do not recall their history of past observations and cannot reason about how other agents' beliefs are formed. They do so by making rational inferences about their observations which include a sequence of independent and identically distributed private signals as well as the actions of their neighboring agents at each time. Successive applications of Bayes rule to the entire history of past observations lead to forebodingly complex inferences: due to lack of knowledge about the global network structure, and unavailability of private observations, as well as third party interactions preceding every decision. Such difficulties make Bayesian updating of beliefs an implausible mechanism for social learning. To address these complexities, we consider a Bayesian without Recall model of inference. On the one hand, this model provides a tractable framework for analyzing the behavior of rational agents in social networks. On the other hand, this model also provides a behavioral foundation for the variety of non-Bayesian update rules in the literature. We present the implications of various choices for the structure of the action space and utility functions for such agents and investigate the properties of learning, convergence, and consensus in special cases.

OCMar 17, 2017

Decentralized Observability with Limited Communication between Sensors

Andreea B. Alexandru, Sergio Pequito, Ali Jadbabaie et al.

In this paper, we study the problem of jointly retrieving the state of a dynamical system, as well as the state of the sensors deployed to estimate it. We assume that the sensors possess a simple computational unit that is capable of performing simple operations, such as retaining the current state and model of the system in its memory. We assume the system to be observable (given all the measurements of the sensors), and we ask whether each sub-collection of sensors can retrieve the state of the underlying physical system, as well as the state of the remaining sensors. To this end, we consider communication between neighboring sensors, whose adjacency is captured by a communication graph. We then propose a linear update strategy that encodes the sensor measurements as states in an augmented state space, with which we provide the solution to the problem of retrieving the system and sensor states. The present paper contains three main contributions. First, we provide necessary and sufficient conditions to ensure observability of the system and sensor states from any sensor. Second, we address the problem of adding communication between sensors when the necessary and sufficient conditions are not satisfied, and devise a strategy to this end. Third, we extend the former case to include different costs of communication between sensors. Finally, the concepts defined and the method proposed are used to assess the state of an example of approximate structural brain dynamics through linearized measurements.

LGApr 3, 2022

Byzantine-Robust Federated Linear Bandits

Ali Jadbabaie, Haochuan Li, Jian Qian et al.

In this paper, we study a linear bandit optimization problem in a federated setting where a large collection of distributed agents collaboratively learn a common linear bandit model. Standard federated learning algorithms applied to this setting are vulnerable to Byzantine attacks on even a small fraction of agents. We propose a novel algorithm with a robust aggregation oracle that utilizes the geometric median. We prove that our proposed algorithm is robust to Byzantine attacks on fewer than half of agents and achieves a sublinear $\tilde{\mathcal{O}}({T^{3/4}})$ regret with $\mathcal{O}(\sqrt{T})$ steps of communication in $T$ steps. Moreover, we make our algorithm differentially private via a tree-based mechanism. Finally, if the level of corruption is known to be small, we show that using the geometric median of mean oracle for robust aggregation further improves the regret bound.

SYJan 17, 2018

On the Limited Communication Analysis and Design for Decentralized Estimation

Andreea B. Alexandru, Sergio Pequito, Ali Jadbabaie et al.

This paper pertains to the analysis and design of decentralized estimation schemes that make use of limited communication. Briefly, these schemes equip the sensors with scalar states that iteratively merge the measurements and the state of other sensors to be used for state estimation. Contrarily to commonly used distributed estimation schemes, the only information being exchanged are scalars, there is only one common time-scale for communication and estimation, and the retrieval of the state of the system and sensors is achieved in finite-time. We extend previous work to a more general setup and provide necessary and sufficient conditions required for the communication between the sensors that enable the use of limited communication decentralized estimation~schemes. Additionally, we discuss the cases where the sensors are memoryless, and where the sensors might not have the capacity to discern the contributions of other sensors. Based on these conditions and the fact that communication channels incur a cost, we cast the problem of finding the minimum cost communication graph that enables limited communication decentralized estimation schemes as an integer programming problem.

OCJul 3, 2022

On Convergence of Gradient Descent Ascent: A Tight Local Analysis

Haochuan Li, Farzan Farnia, Subhro Das et al.

Gradient Descent Ascent (GDA) methods are the mainstream algorithms for minimax optimization in generative adversarial networks (GANs). Convergence properties of GDA have drawn significant interest in the recent literature. Specifically, for $\min_{\mathbf{x}} \max_{\mathbf{y}} f(\mathbf{x};\mathbf{y})$ where $f$ is strongly-concave in $\mathbf{y}$ and possibly nonconvex in $\mathbf{x}$, (Lin et al., 2020) proved the convergence of GDA with a stepsize ratio $η_{\mathbf{y}}/η_{\mathbf{x}}=Θ(κ^2)$ where $η_{\mathbf{x}}$ and $η_{\mathbf{y}}$ are the stepsizes for $\mathbf{x}$ and $\mathbf{y}$ and $κ$ is the condition number for $\mathbf{y}$. While this stepsize ratio suggests a slow training of the min player, practical GAN algorithms typically adopt similar stepsizes for both variables, indicating a wide gap between theoretical and empirical results. In this paper, we aim to bridge this gap by analyzing the \emph{local convergence} of general \emph{nonconvex-nonconcave} minimax problems. We demonstrate that a stepsize ratio of $Θ(κ)$ is necessary and sufficient for local convergence of GDA to a Stackelberg Equilibrium, where $κ$ is the local condition number for $\mathbf{y}$. We prove a nearly tight convergence rate with a matching lower bound. We further extend the convergence guarantees to stochastic GDA and extra-gradient methods (EG). Finally, we conduct several numerical experiments to support our theoretical findings.

MANov 2, 2016

Bayesian Heuristics for Group Decisions

M. Amin Rahimian, Ali Jadbabaie

We propose a model of inference and heuristic decision-making in groups that is rooted in the Bayes rule but avoids the complexities of rational inference in partially observed environments with incomplete information, which are characteristic of group interactions. Our model is also consistent with a dual-process psychological theory of thinking: the group members behave rationally at the initiation of their interactions with each other (the slow and deliberative mode); however, in the ensuing decision epochs, they rely on a heuristic that replicates their experiences from the first stage (the fast automatic mode). We specialize this model to a group decision scenario where private observations are received at the beginning, and agents aim to take the best action given the aggregate observations of all group members. We study the implications of the information structure together with the properties of the probability distributions which determine the structure of the so-called "Bayesian heuristics" that the agents follow in our model. We also analyze the group decision outcomes in two classes of linear action updates and log-linear belief updates and show that many inefficiencies arise in group decisions as a result of repeated interactions between individuals, leading to overconfident beliefs as well as choice-shifts toward extremes. Nevertheless, balanced regular structures demonstrate a measure of efficiency in terms of aggregating the initial information of individuals. These results not only verify some well-known insights about group decision-making but also complement these insights by revealing additional mechanistic interpretations for the group declension-process, as well as psychological and cognitive intuitions about the group interaction model.

OCOct 17, 2022

Model Predictive Control via On-Policy Imitation Learning

Kwangjun Ahn, Zakaria Mhammedi, Horia Mania et al.

In this paper, we leverage the rapid advances in imitation learning, a topic of intense recent focus in the Reinforcement Learning (RL) literature, to develop new sample complexity results and performance guarantees for data-driven Model Predictive Control (MPC) for constrained linear systems. In its simplest form, imitation learning is an approach that tries to learn an expert policy by querying samples from an expert. Recent approaches to data-driven MPC have used the simplest form of imitation learning known as behavior cloning to learn controllers that mimic the performance of MPC by online sampling of the trajectories of the closed-loop MPC system. Behavior cloning, however, is a method that is known to be data inefficient and suffer from distribution shifts. As an alternative, we develop a variant of the forward training algorithm which is an on-policy imitation learning method proposed by Ross et al. (2010). Our algorithm uses the structure of constrained linear MPC, and our analysis uses the properties of the explicit MPC solution to theoretically bound the number of online MPC trajectories needed to achieve optimal performance. We validate our results through simulations and show that the forward training algorithm is indeed superior to behavior cloning when applied to MPC.

SYDec 31, 2015

Finite-time Analysis of the Distributed Detection Problem

Shahin Shahrampour, Alexander Rakhlin, Ali Jadbabaie

This paper addresses the problem of distributed detection in fixed and switching networks. A network of agents observe partially informative signals about the unknown state of the world. Hence, they collaborate with each other to identify the true state. We propose an update rule building on distributed, stochastic optimization methods. Our main focus is on the finite-time analysis of the problem. For fixed networks, we bring forward the notion of Kullback-Leibler cost to measure the efficiency of the algorithm versus its centralized analog. We bound the cost in terms of the network size, spectral gap and relative entropy of agents' signal structures. We further consider the problem in random networks where the structure is realized according to a stationary distribution. We then prove that the convergence is exponentially fast (with high probability), and the non-asymptotic rate scales inversely in the spectral gap of the expected network.

MASep 30, 2010

Spectral Control of Mobile Robot Networks

Michael M. Zavlanos, Victor M. Preciado, Ali Jadbabaie

The eigenvalue spectrum of the adjacency matrix of a network is closely related to the behavior of many dynamical processes run over the network. In the field of robotics, this spectrum has important implications in many problems that require some form of distributed coordination within a team of robots. In this paper, we propose a continuous-time control scheme that modifies the structure of a position-dependent network of mobile robots so that it achieves a desired set of adjacency eigenvalues. For this, we employ a novel abstraction of the eigenvalue spectrum by means of the adjacency matrix spectral moments. Since the eigenvalue spectrum is uniquely determined by its spectral moments, this abstraction provides a way to indirectly control the eigenvalues of the network. Our construction is based on artificial potentials that capture the distance of the network's spectral moments to their desired values. Minimization of these potentials is via a gradient descent closed-loop system that, under certain convexity assumptions, ensures convergence of the network topology to one with the desired set of moments and, therefore, eigenvalues. We illustrate our approach in nontrivial computer simulations.

OCSep 3, 2012

Structural Analysis of Viral Spreading Processes in Social and Communication Networks Using Egonets

Victor M. Preciado, Moez Draief, Ali Jadbabaie

We study how the behavior of viral spreading processes is influenced by local structural properties of the network over which they propagate. For a wide variety of spreading processes, the largest eigenvalue of the adjacency matrix of the network plays a key role on their global dynamical behavior. For many real-world large-scale networks, it is unfeasible to exactly retrieve the complete network structure to compute its largest eigenvalue. Instead, one usually have access to myopic, egocentric views of the network structure, also called egonets. In this paper, we propose a mathematical framework, based on algebraic graph theory and convex optimization, to study how local structural properties of the network constrain the interval of possible values in which the largest eigenvalue must lie. Based on this framework, we present a computationally efficient approach to find this interval from a collection of egonets. Our numerical simulations show that, for several social and communication networks, local structural properties of the network strongly constrain the location of the largest eigenvalue and the resulting spreading dynamics. From a practical point of view, our results can be used to dictate immunization strategies to tame the spreading of a virus, or to design network topologies that facilitate the spreading of information virally.

SYJun 2, 2023

On the Sample Complexity of Imitation Learning for Smoothed Model Predictive Control

Daniel Pfrommer, Swati Padmanabhan, Kwangjun Ahn et al.

Recent work in imitation learning has shown that having an expert controller that is both suitably smooth and stable enables stronger guarantees on the performance of the learned controller. However, constructing such smoothed expert controllers for arbitrary systems remains challenging, especially in the presence of input and state constraints. As our primary contribution, we show how such a smoothed expert can be designed for a general class of systems using a log-barrier-based relaxation of a standard Model Predictive Control (MPC) optimization problem. At the crux of this theoretical guarantee on smoothness is a new lower bound we prove on the optimality gap of the analytic center associated with a convex Lipschitz function, which we hope could be of independent interest. We validate our theoretical findings via experiments, demonstrating the merits of our smoothing approach over randomized smoothing.

LGJun 16, 2022

Gradient Descent for Low-Rank Functions

Romain Cosson, Ali Jadbabaie, Anuran Makur et al.

Several recent empirical studies demonstrate that important machine learning tasks, e.g., training deep neural networks, exhibit low-rank structure, where the loss function varies significantly in only a few directions of the input space. In this paper, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (GD). Our proposed \emph{Low-Rank Gradient Descent} (LRGD) algorithm finds an $ε$-approximate stationary point of a $p$-dimensional function by first identifying $r \leq p$ significant directions, and then estimating the true $p$-dimensional gradient at every iteration by computing directional derivatives only along those $r$ directions. We establish that the "directional oracle complexities" of LRGD for strongly convex and non-convex objective functions are $\mathcal{O}(r \log(1/ε) + rp)$ and $\mathcal{O}(r/ε^2 + rp)$, respectively. When $r \ll p$, these complexities are smaller than the known complexities of $\mathcal{O}(p \log(1/ε))$ and $\mathcal{O}(p/ε^2)$ of {\gd} in the strongly convex and non-convex settings, respectively. Thus, LRGD significantly reduces the computational cost of gradient-based methods for sufficiently low-rank functions. In the course of our analysis, we also formally define and characterize the classes of exact and approximately low-rank functions.

GTJun 12, 2011

Analysis of Equilibria and Strategic Interaction in Complex Networks

Victor M. Preciado, Jaelynn Oh, Ali Jadbabaie

This paper studies $n$-person simultaneous-move games with linear best response function, where individuals interact within a given network structure. This class of games have been used to model various settings, such as, public goods, belief formation, peer effects, and oligopoly. The purpose of this paper is to study the effect of the network structure on Nash equilibrium outcomes of this class of games. Bramoullé et al. derived conditions for uniqueness and stability of a Nash equilibrium in terms of the smallest eigenvalue of the adjacency matrix representing the network of interactions. Motivated by this result, we study how local structural properties of the network of interactions affect this eigenvalue, influencing game equilibria. In particular, we use algebraic graph theory and convex optimization to derive new bounds on the smallest eigenvalue in terms of the distribution of degrees, cycles, and other relevant substructures. We illustrate our results with numerical simulations involving online social networks.

SISep 28, 2016

Bio-Inspired Framework for Allocation of Protection Resources in Cyber-Physical Networks

Victor M. Preciado, Michael Zargham, Chinwendu Enyioha et al.

In this chapter, we consider the problem of designing protection strategies to contain spreading processes in complex cyber-physical networks. We illustrate our ideas using a family of bio-motivated spreading models originally proposed in the epidemiological literature, e.g., the Susceptible-Infected-Susceptible (SIS) model. We first introduce a framework in which we are allowed to distribute two types of resources in order to contain the spread, namely, (i) preventive resources able to reduce the spreading rate, and (ii) corrective resources able to increase the recovery rate of nodes in which the resources are allocated. In practice, these resources have an associated cost that depends on either the resiliency level achieved by the preventive resource, or the restoration efficiency of the corrective resource. We present a mathematical framework, based on dynamic systems theory and convex optimization, to find the cost-optimal distribution of protection resources in a network to contain the spread. We also present two extensions to this framework in which (i) we consider generalized epidemic models, beyond the simple SIS model, and (ii) we assume uncertainties in the contact network in which the spreading is taking place. We compare these protection strategies with common heuristics previously proposed in the literature and illustrate our results with numerical simulations using the air traffic network.

SYJun 17, 2018

Decoupled Dynamics Distributed Control for Strings of Nonlinear Autonomous Agents

Serban Sabau, Irinel-Constantin Morarescu, Lucian Busoniu et al.

We introduce a distributed control architecture for a class of heterogeneous, nonlinear dynamical agents moving in the "string" formation, while guaranteeing trajectory tracking, collision avoidance and the preservation of the formation's topology. Each autonomous agent uses information and relative measurements only with respect to its predecessor in the string. The performance of the scheme is independent of the number of agents in the network and also on the agent's relative position in the network. The scalability is a consequence of the "decoupling" of a certain bounded approximation of the closed--loop equations, which allows the regulation and controller design (at each agent) to be done individually, in a completely decentralized manner. A practical method for compensating communication induced delays is also presented. Numerical examples illustrate the effectiveness and the main features of the proposed approach.

89.2OCMar 16

Online Learning for Supervisory Switching Control

Haoyuan Sun, Ali Jadbabaie

We study supervisory switching control for partially-observed linear dynamical systems. The objective is to identify and deploy the best controller for the unknown system by periodically selecting among a collection of $N$ candidate controllers, some of which may destabilize the underlying system. While classical estimator-based supervisory control guarantees asymptotic stability, it lacks quantitative finite-time performance bounds. Conversely, current non-asymptotic methods in both online learning and system identification require restrictive assumptions that are incompatible in a control setting, such as system stability, which preclude testing potentially unstable controllers. To bridge this gap, we propose a novel, non-asymptotic analysis of supervisory control that adapts multi-armed bandit algorithms to address these control-theoretic challenges. Our data-driven algorithm evaluates candidate controllers via scoring criteria that leverage system observability to isolate the effects of historical states, enabling both detection of destabilizing controllers and accurate system identification. We present two algorithmic variants with dimension-free, finite-time guarantees, where each identifies the most suitable controller in $\mathcal{O}(N \log N)$ steps, while simultaneously achieving finite $L_2$-gain with respect to system disturbances.

LGJul 27, 2023

Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior

Adam Block, Ali Jadbabaie, Daniel Pfrommer et al.

We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. Our framework invokes low-level controllers - either learned or implicit in position-command control - to stabilize imitation around expert demonstrations. We show that with (a) a suitable low-level stability guarantee and (b) a powerful enough generative model as our imitation learner, pure supervised behavior cloning can generate trajectories matching the per-time step distribution of essentially arbitrary expert trajectories in an optimal transport cost. Our analysis relies on a stochastic continuity property of the learned policy we call "total variation continuity" (TVC). We then show that TVC can be ensured with minimal degradation of accuracy by combining a popular data-augmentation regimen with a novel algorithmic trick: adding augmentation noise at execution time. We instantiate our guarantees for policies parameterized by diffusion models and prove that if the learner accurately estimates the score of the (noise-augmented) expert policy, then the distribution of imitator trajectories is close to the demonstrator distribution in a natural optimal transport distance. Our analysis constructs intricate couplings between noise-augmented trajectories, a technique that may be of independent interest. We conclude by empirically validating our algorithmic recommendations, and discussing implications for future research directions for better behavior cloning with generative modeling.

95.8RMApr 2

Network and Risk Analysis of Surety Bonds

Tamara Broderick, Ali Jadbabaie, Vanessa Lin et al.

Surety bonds are financial agreements between a contractor (principal) and obligee (project owner) to complete a project. However, most large-scale projects involve multiple contractors, creating a network and introducing the possibility of incomplete obligations to propagate and result in project failures. Typical models for risk assessment assume independent failure probabilities within each contractor. However, we take a network approach, modeling the contractor network as a directed graph where nodes represent contractors and project owners and edges represent contractual obligations with associated financial records. To understand risk propagation throughout the contractor network, we extend the celebrated Friedkin-Johnsen model and introduce a stochastic process to simulate principal failures across the network. From a theoretical perspective, we show that under natural monotonicity conditions on the contractor network, incorporating network effects leads to increases in the average risk for the surety organization. We further use data from a partnering insurance company to validate our findings, estimating an approximately 2% higher exposure when accounting for network effects.

LGDec 21, 2025

Is Your Conditional Diffusion Model Actually Denoising?

Daniel Pfrommer, Zehao Dou, Christopher Scarvelis et al.

We study the inductive biases of diffusion models with a conditioning-variable, which have seen widespread application as both text-conditioned generative image models and observation-conditioned continuous control policies. We observe that when these models are queried conditionally, their generations consistently deviate from the idealized "denoising" process upon which diffusion models are formulated, inducing disagreement between popular sampling algorithms (e.g. DDPM, DDIM). We introduce Schedule Deviation, a rigorous measure which captures the rate of deviation from a standard denoising process, and provide a methodology to compute it. Crucially, we demonstrate that the deviation from an idealized denoising process occurs irrespective of the model capacity or amount of training data. We posit that this phenomenon occurs due to the difficulty of bridging distinct denoising flows across different parts of the conditioning space and show theoretically how such a phenomenon can arise through an inductive bias towards smoothness.

LGFeb 4, 2025

On the Emergence of Position Bias in Transformers

Xinyi Wu, Yifei Wang, Stefanie Jegelka et al.

Recent studies have revealed various manifestations of position bias in transformer architectures, from the "lost-in-the-middle" phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masks and positional encodings shape these biases remains elusive. This paper presents a graph-theoretic framework for analyzing position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokens interact with contextual information based on their sequential positions. We uncover two key insights: First, causal masking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly more contextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask and relative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanisms introduce distance-based decay within individual attention maps, their aggregate effect across multiple attention layers$\unicode{x2013}$coupled with the causal mask$\unicode{x2013}$leads to a trade-off between the long-term decay effects and the cumulative importance of early sequence positions. Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundation for understanding positional biases in transformers, shedding light on the complex interplay of attention mechanism components and guiding more informed architectural design.

LGMar 12, 2025

The Pitfalls of Imitation Learning when Actions are Continuous

Max Simchowitz, Daniel Pfrommer, Ali Jadbabaie

We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action control system. We show that, even if the dynamics satisfy a control-theoretic property called exponential stability (i.e. the effects of perturbations decay exponentially quickly), and the expert is smooth and deterministic, any smooth, deterministic imitator policy necessarily suffers error on execution that is exponentially larger, as a function of problem horizon, than the error under the distribution of expert training data. Our negative result applies to any algorithm which learns solely from expert data, including both behavior cloning and offline-RL algorithms, unless the algorithm produces highly "improper" imitator policies--those which are non-smooth, non-Markovian, or which exhibit highly state-dependent stochasticity--or unless the expert trajectory distribution is sufficiently "spread." We provide experimental evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today's popular policy parameterizations in robot learning (e.g. action-chunking and diffusion policies). We also establish a host of complementary negative and positive results for imitation in control systems.

LGJan 30, 2025

On the Role of Transformer Feed-Forward Layers in Nonlinear In-Context Learning

Haoyuan Sun, Ali Jadbabaie, Navid Azizan · mit

Transformer-based models demonstrate a remarkable ability for in-context learning (ICL), where they can adapt to unseen tasks from a few prompt examples without parameter updates. Recent research has illuminated how Transformers perform ICL, showing that the optimal linear self-attention (LSA) mechanism can implement one step of gradient descent for linear least-squares objectives when trained on random linear regression tasks. Building on this, we investigate ICL for nonlinear function classes. We first prove that LSA is inherently incapable of outperforming linear predictors on nonlinear tasks, underscoring why prior solutions cannot readily extend to these problems. To overcome this limitation, we analyze a Transformer block consisting of LSA and feed-forward layers inspired by the gated linear units (GLU), which is a standard component of modern Transformers. We show that this block achieves nonlinear ICL by implementing one step of gradient descent on a polynomial kernel regression loss. Furthermore, our analysis reveals that the expressivity of a single block is inherently limited by its dimensions. We then show that a deep Transformer can overcome this bottleneck by distributing the computation of richer kernel functions across multiple blocks, performing block-coordinate descent in a high-dimensional feature space that a single block cannot represent. Our findings highlight that the feed-forward layers provide a crucial and scalable mechanism by which Transformers can express nonlinear representations for ICL.

LGJul 1, 2025

A Test-Function Approach to Incremental Stability

Daniel Pfrommer, Max Simchowitz, Ali Jadbabaie

This paper presents a novel framework for analyzing Incremental-Input-to-State Stability ($δ$ISS) based on the idea of using rewards as "test functions." Whereas control theory traditionally deals with Lyapunov functions that satisfy a time-decrease condition, reinforcement learning (RL) value functions are constructed by exponentially decaying a Lipschitz reward function that may be non-smooth and unbounded on both sides. Thus, these RL-style value functions cannot be directly understood as Lyapunov certificates. We develop a new equivalence between a variant of incremental input-to-state stability of a closed-loop system under given a policy, and the regularity of RL-style value functions under adversarial selection of a Hölder-continuous reward function. This result highlights that the regularity of value functions, and their connection to incremental stability, can be understood in a way that is distinct from the traditional Lyapunov-based approach to certifying stability in control theory.

DSFeb 13, 2025

Fast Tensor Completion via Approximate Richardson Iteration

Mehrdad Ghadiri, Matthew Fahrbach, Yunbum Kook et al.

We study tensor completion (TC) through the lens of low-rank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods to solve highly structured linear regression problems at each step (e.g., for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is often lost in TC regression problems, making direct extensions unclear. This work proposes a novel lifting method for approximately solving TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We analyze the convergence rate of our approximate Richardson iteration-based algorithm, and our empirical study shows that it can be 100x faster than direct methods for CP completion on real-world tensors.

LGJun 5, 2024

Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs

Michael Scholkemper, Xinyi Wu, Ali Jadbabaie et al.

Residual connections and normalization layers have become standard design choices for graph neural networks (GNNs), and were proposed as solutions to the mitigate the oversmoothing problem in GNNs. However, how exactly these methods help alleviate the oversmoothing problem from a theoretical perspective is not well understood. In this work, we provide a formal and precise characterization of (linearized) GNNs with residual connections and normalization layers. We establish that (a) for residual connections, the incorporation of the initial features at each layer can prevent the signal from becoming too smooth, and determines the subspace of possible node representations; (b) batch normalization prevents a complete collapse of the output embedding space to a one-dimensional subspace through the individual rescaling of each column of the feature matrix. This results in the convergence of node representations to the top-$k$ eigenspace of the message-passing operator; (c) moreover, we show that the centering step of a normalization layer -- which can be understood as a projection -- alters the graph signal in message-passing in such a way that relevant information can become harder to extract. We therefore introduce a novel, principled normalization layer called GraphNormv2 in which the centering step is learned such that it does not distort the original graph signal in an undesirable way. Experimental results confirm the effectiveness of our method.

OCApr 11, 2024

A least-square method for non-asymptotic identification in linear switching control

Haoyuan Sun, Ali Jadbabaie

The focus of this paper is on linear system identification in the setting where it is known that the underlying partially-observed linear dynamical system lies within a finite collection of known candidate models. We first consider the problem of identification from a given trajectory, which in this setting reduces to identifying the index of the true model with high probability. We characterize the finite-time sample complexity of this problem by leveraging recent advances in the non-asymptotic analysis of linear least-square methods in the literature. In comparison to the earlier results that assume no prior knowledge of the system, our approach takes advantage of the smaller hypothesis class and leads to the design of a learner with a dimension-free sample complexity bound. Next, we consider the switching control of linear systems, where there is a candidate controller for each of the candidate models and data is collected through interaction of the system with a collection of potentially destabilizing controllers. We develop a dimension-dependent criterion that can detect those destabilizing controllers in finite time. By leveraging these results, we propose a data-driven switching strategy that identifies the unknown parameters of the underlying system. We then provide a non-asymptotic analysis of its performance and discuss its implications on the classical method of estimator-based supervisory control.

LGMar 25, 2024

Belief Samples Are All You Need For Social Learning

Mahyar JafariNodeh, Amir Ajorlou, Ali Jadbabaie

In this paper, we consider the problem of social learning, where a group of agents embedded in a social network are interested in learning an underlying state of the world. Agents have incomplete, noisy, and heterogeneous sources of information, providing them with recurring private observations of the underlying state of the world. Agents can share their learning experience with their peers by taking actions observable to them, with values from a finite feasible set of states. Actions can be interpreted as samples from the beliefs which agents may form and update on what the true state of the world is. Sharing samples, in place of full beliefs, is motivated by the limited communication, cognitive, and information-processing resources available to agents especially in large populations. Previous work (Salhab et al.) poses the question as to whether learning with probability one is still achievable if agents are only allowed to communicate samples from their beliefs. We provide a definite positive answer to this question, assuming a strongly connected network and a ``collective distinguishability'' assumption, which are both required for learning even in full-belief-sharing settings. In our proposed belief update mechanism, each agent's belief is a normalized weighted geometric interpolation between a fully Bayesian private belief -- aggregating information from the private source -- and an ensemble of empirical distributions of the samples shared by her neighbors over time. By carefully constructing asymptotic almost-sure lower/upper bounds on the frequency of shared samples matching the true state/or not, we rigorously prove the convergence of all the beliefs to the true state, with probability one.

LGMay 25, 2023

Demystifying Oversmoothing in Attention-Based Graph Neural Networks

Xinyi Wu, Amir Ajorlou, Zihui Wu et al.

Oversmoothing in Graph Neural Networks (GNNs) refers to the phenomenon where increasing network depth leads to homogeneous node representations. While previous work has established that Graph Convolutional Networks (GCNs) exponentially lose expressive power, it remains controversial whether the graph attention mechanism can mitigate oversmoothing. In this work, we provide a definitive answer to this question through a rigorous mathematical analysis, by viewing attention-based GNNs as nonlinear time-varying dynamical systems and incorporating tools and techniques from the theory of products of inhomogeneous matrices and the joint spectral radius. We establish that, contrary to popular belief, the graph attention mechanism cannot prevent oversmoothing and loses expressive power exponentially. The proposed framework extends the existing results on oversmoothing for symmetric GCNs to a significantly broader class of GNN models, including random walk GCNs, Graph Attention Networks (GATs) and (graph) transformers. In particular, our analysis accounts for asymmetric, state-dependent and time-varying aggregation operators and a wide range of common nonlinear activation functions, such as ReLU, LeakyReLU, GELU and SiLU.

LGMay 25, 2023

How to escape sharp minima with random perturbations

Kwangjun Ahn, Ali Jadbabaie, Suvrit Sra

Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, we adopt the trace of the Hessian of the cost function as a measure of flatness, and use it to formally define the notion of approximate flat minima. Under this notion, we then analyze algorithms that find approximate flat minima efficiently. For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently. The main component of the algorithm is to use gradients computed from randomly perturbed iterates to estimate a direction that leads to flatter minima. For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization, supporting its success in practice.

MLJan 7, 2022

Unifying Epidemic Models with Mixtures

Arnab Sarker, Ali Jadbabaie, Devavrat Shah

The COVID-19 pandemic has emphasized the need for a robust understanding of epidemic models. Current models of epidemics are classified as either mechanistic or non-mechanistic: mechanistic models make explicit assumptions on the dynamics of disease, whereas non-mechanistic models make assumptions on the form of observed time series. Here, we introduce a simple mixture-based model which bridges the two approaches while retaining benefits of both. The model represents time series of cases and fatalities as a mixture of Gaussian curves, providing a flexible function class to learn from data compared to traditional mechanistic models. Although the model is non-mechanistic, we show that it arises as the natural outcome of a stochastic process based on a networked SIR framework. This allows learned parameters to take on a more meaningful interpretation compared to similar non-mechanistic models, and we validate the interpretations using auxiliary mobility data collected during the COVID-19 pandemic. We provide a simple learning algorithm to identify model parameters and establish theoretical results which show the model can be efficiently learned from data. Empirically, we find the model to have low prediction error. The model is available live at covidpredictions.mit.edu. Ultimately, this allows us to systematically understand the impacts of interventions on COVID-19, which is critical in developing data-driven solutions to controlling epidemics.

LGJan 6, 2022

Federated Optimization of Smooth Loss Functions

Ali Jadbabaie, Anuran Makur, Devavrat Shah

In this work, we study empirical risk minimization (ERM) within a federated learning framework, where a central server minimizes an ERM objective function using training data that is stored across $m$ clients. In this setting, the Federated Averaging (FedAve) algorithm is the staple for determining $ε$-approximate solutions to the ERM problem. Similar to standard optimization algorithms, the convergence analysis of FedAve only relies on smoothness of the loss function in the optimization parameter. However, loss functions are often very smooth in the training data too. To exploit this additional smoothness, we propose the Federated Low Rank Gradient Descent (FedLRGD) algorithm. Since smoothness in data induces an approximate low rank structure on the loss function, our method first performs a few rounds of communication between the server and clients to learn weights that the server can use to approximate clients' gradients. Then, our method solves the ERM problem at the server using inexact gradient descent. To show that FedLRGD can have superior performance to FedAve, we present a notion of federated oracle complexity as a counterpart to canonical oracle complexity. Under some assumptions on the loss function, e.g., strong convexity in parameter, $η$-Hölder smoothness in data, etc., we prove that the federated oracle complexity of FedLRGD scales like $φm(p/ε)^{Θ(d/η)}$ and that of FedAve scales like $φm(p/ε)^{3/4}$ (neglecting sub-dominant factors), where $φ\gg 1$ is a "communication-to-computation ratio," $p$ is the parameter dimension, and $d$ is the data dimension. Then, we show that when $d$ is small and the loss function is sufficiently smooth in the data, FedLRGD beats FedAve in federated oracle complexity. Finally, in the course of analyzing FedLRGD, we also establish a result on low rank approximation of latent variable models.