SYApr 30, 2018
Nash and Wardrop equilibria in aggregative games with coupling constraintsDario Paccagnan, Basilio Gentile, Francesca Parise et al.
We consider the framework of aggregative games, in which the cost function of each agent depends on his own strategy and on the average population strategy. As first contribution, we investigate the relations between the concepts of Nash and Wardrop equilibrium. By exploiting a characterization of the two equilibria as solutions of variational inequalities, we bound their distance with a decreasing function of the population size. As second contribution, we propose two decentralized algorithms that converge to such equilibria and are capable of coping with constraints coupling the strategies of different agents. Finally, we study the applications of charging of electric vehicles and of route choice on a road network.
OCJun 7, 2016
Robust Optimal Control with Adjustable Uncertainty SetsXiaojing Zhang, Maryam Kamgarpour, Angelos Georghiou et al.
In this paper, we develop a unified framework for studying constrained robust optimal control problems with adjustable uncertainty sets. In contrast to standard constrained robust optimal control problems with known uncertainty sets, we treat the uncertainty sets in our problems as additional decision variables. In particular, given a finite prediction horizon and a metric for adjusting the uncertainty sets, we address the question of determining the optimal size and shape of the uncertainty sets, while simultaneously ensuring the existence of a control policy that will keep the system within its constraints for all possible disturbance realizations inside the adjusted uncertainty set. Since our problem subsumes the classical constrained robust optimal control design problem, it is computationally intractable in general. We demonstrate that by restricting the families of admissible uncertainty sets and control policies, the problem can be formulated as a tractable convex optimization problem. We show that our framework captures several families of (convex) uncertainty sets of practical interest, and illustrate our approach on a demand response problem of providing control reserves for a power system.
GTMay 26
On characterization and existence of constrained correlated equilibria in Markov gamesTingting Ni, Anna Maddux, Maryam Kamgarpour
Markov games with coupling constraints model constrained dynamical decision-making involving self-interested agents, where the feasibility of an individual agent's strategy depends on the joint strategies of the others. Such games arise in numerous real-world applications involving safety requirements and budget caps, for example, in environmental management, electricity markets, and transportation systems. In unconstrained dynamical decision-making, the correlated equilibrium has emerged as a desired solution concept due to its computational tractability and amenability to learning algorithms. Understanding how coupling constraints shape correlated equilibria is a crucial step towards computing solutions in constrained Markov games. In this paper, we formalize and characterize the notion of constrained correlated equilibria for Markov games, defined as feasible joint policies where any unilateral deviation is either unprofitable or infeasible. Building on this characterization, we further study existence conditions for constrained correlated equilibria. In particular, we provide a novel existence proof of such equilibria in Markov games with coupling constraints.
OCDec 6, 2012
Approximate Dynamic Programming via Sum of Squares ProgrammingTyler H. Summers, Konstantin Kunz, Nikolaos Kariotoglou et al.
We describe an approximate dynamic programming method for stochastic control problems on infinite state and input spaces. The optimal value function is approximated by a linear combination of basis functions with coefficients as decision variables. By relaxing the Bellman equation to an inequality, one obtains a linear program in the basis coefficients with an infinite set of constraints. We show that a recently introduced method, which obtains convex quadratic value function approximations, can be extended to higher order polynomial approximations via sum of squares programming techniques. An approximate value function can then be computed offline by solving a semidefinite program, without having to sample the infinite constraint. The policy is evaluated online by solving a polynomial optimization problem, which also turns out to be convex in some cases. We experimentally validate the method on an autonomous helicopter testbed using a 10-dimensional helicopter model.
SYJul 11, 2020
An Input-Output Parametrization of Stabilizing Controllers: amidst Youla and System Level SynthesisLuca Furieri, Yang Zheng, Antonis Papachristodoulou et al.
This paper proposes a novel input-output parametrization of the set of internally stabilizing output-feedback controllers for linear time-invariant (LTI) systems. Our underlying idea is to directly treat the closed-loop transfer matrices from disturbances to input and output signals as design parameters and exploit their affine relationships. This input-output perspective is particularly effective when a doubly-coprime factorization is difficult to compute, or an initial stabilizing controller is challenging to find; most previous work requires one of these pre-computation steps. Instead, our approach can bypass such pre-computations, in the sense that a stabilizing controller is computed by directly solving a linear program (LP). Furthermore, we show that the proposed input-output parametrization allows for computing norm-optimal controllers subject to quadratically invariant (QI) constraints using convex programming.
SYMay 1, 2017
Sequential Linear Quadratic Optimal Control for Nonlinear Switched SystemsFarbod Farshidian, Maryam Kamgarpour, Diego Pardo et al.
In this contribution, we introduce an efficient method for solving the optimal control problem for an unconstrained nonlinear switched system with an arbitrary cost function. We assume that the sequence of the switching modes are given but the switching time in between consecutive modes remains to be optimized. The proposed method uses a two-stage approach as introduced by Xu and Antsaklis (2004) where the original optimal control problem is transcribed into an equivalent problem parametrized by the switching times and the optimal control policy is obtained based on the solution of a two-point boundary value differential equation. The main contribution of this paper is to use a Sequential Linear Quadratic approach to synthesize the optimal controller instead of solving a boundary value problem. The proposed method is numerically more efficient and scales very well to the high dimensional problems. In order to evaluate its performance, we use two numerical examples as benchmarks to compare against the baseline algorithm. In the third numerical example, we apply the proposed algorithm to the Center of Mass control problem in a quadruped robot locomotion task.
LGMar 14, 2022
Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium ComputationPier Giuseppe Sessa, Maryam Kamgarpour, Andreas Krause
We consider model-based multi-agent reinforcement learning, where the environment transition model is unknown and can only be learned via expensive interactions with the environment. We propose H-MARL (Hallucinated Multi-Agent Reinforcement Learning), a novel sample-efficient algorithm that can efficiently balance exploration, i.e., learning about the environment, and exploitation, i.e., achieve good equilibrium performance in the underlying general-sum Markov game. H-MARL builds high-probability confidence intervals around the unknown transition model and sequentially updates them based on newly observed data. Using these, it constructs an optimistic hallucinated game for the agents for which equilibrium policies are computed at each round. We consider general statistical models (e.g., Gaussian processes, deep ensembles, etc.) and policy classes (e.g., deep neural networks), and theoretically analyze our approach by bounding the agents' dynamic regret. Moreover, we provide a convergence rate to the equilibria of the underlying Markov game. We demonstrate our approach experimentally on an autonomous driving simulation benchmark. H-MARL learns successful equilibrium policies after a few interactions with the environment and can significantly improve the performance compared to non-optimistic exploration methods.
OCOct 6, 2017
The Linear Programming Approach to Reach-Avoid Problems for Markov Decision ProcessesNikolaos Kariotoglou, Maryam Kamgarpour, Tyler Summers et al.
One of the most fundamental problems in Markov decision processes is analysis and control synthesis for safety and reachability specifications. We consider the stochastic reach-avoid problem, in which the objective is to synthesize a control policy to maximize the probability of reaching a target set at a given time, while staying in a safe set at all prior times. We characterize the solution to this problem through an infinite dimensional linear program. We then develop a tractable approximation to the infinite dimensional linear program through finite dimensional approximations of the decision space and constraints. For a large class of Markov decision processes modeled by Gaussian mixtures kernels we show that through a proper selection of the finite dimensional space, one can further reduce the computational complexity of the resulting linear program. We validate the proposed method and analyze its potential with a series of numerical case studies.
SYMar 11, 2019
On Separable Quadratic Lyapunov Functions for Convex Design of Distributed ControllersLuca Furieri, Yang Zheng, Antonis Papachristodoulou et al.
We consider the problem of designing a stabilizing and optimal static controller with a pre-specified sparsity pattern. Since this problem is NP-hard in general, it is necessary to resort to approximation approaches. In this paper, we characterize a class of convex restrictions of this problem that are based on designing a separable quadratic Lyapunov function for the closed-loop system. This approach generalizes previous results based on optimizing over diagonal Lyapunov functions, thus allowing for improved feasibility and performance. Moreover, we suggest a simple procedure to compute favourable structures for the Lyapunov function yielding high-performance distributed controllers. Numerical examples validate our results.
OCJul 21, 2022
Log Barriers for Safe Black-box Optimization with Application to Safe Reinforcement LearningIlnura Usmanova, Yarden As, Maryam Kamgarpour et al.
Optimizing noisy functions online, when evaluating the objective requires experiments on a deployed system, is a crucial task arising in manufacturing, robotics and many others. Often, constraints on safe inputs are unknown ahead of time, and we only obtain noisy information, indicating how close we are to violating the constraints. Yet, safety must be guaranteed at all times, not only for the final output of the algorithm. We introduce a general approach for seeking a stationary point in high dimensional non-linear stochastic optimization problems in which maintaining safety during learning is crucial. Our approach called LB-SGD is based on applying stochastic gradient descent (SGD) with a carefully chosen adaptive step size to a logarithmic barrier approximation of the original problem. We provide a complete convergence analysis of non-convex, convex, and strongly-convex smooth constrained problems, with first-order and zeroth-order feedback. Our approach yields efficient updates and scales better with dimensionality compared to existing approaches. We empirically compare the sample complexity and the computational cost of our method with existing safe learning approaches. Beyond synthetic benchmarks, we demonstrate the effectiveness of our approach on minimizing constraint violation in policy search tasks in safe reinforcement learning (RL).
OCNov 19, 2018
Performance guarantees for greedy maximization of non-submodular controllability metricsTyler Summers, Maryam Kamgarpour
A key problem in emerging complex cyber-physical networks is the design of information and control topologies, including sensor and actuator selection and communication network design. These problems can be posed as combinatorial set function optimization problems to maximize a dynamic performance metric for the network. Some systems and control metrics feature a property called submodularity, which allows simple greedy algorithms to obtain provably near-optimal topology designs. However, many important metrics lack submodularity and therefore lack provable guarantees for using a greedy optimization approach. Here we show that performance guarantees can be obtained for greedy maximization of certain non-submodular functions of the controllability and observability Gramians. Our results are based on two key quantities: the submodularity ratio, which quantifies how far a set function is from being submodular, and the curvature, which quantifies how far a set function is from being supermodular.
LGJun 1, 2023
Identifiability and Generalizability in Constrained Inverse Reinforcement LearningAndreas Schlaginhaufen, Maryam Kamgarpour
Two main challenges in Reinforcement Learning (RL) are designing appropriate reward functions and ensuring the safety of the learned policy. To address these challenges, we present a theoretical framework for Inverse Reinforcement Learning (IRL) in constrained Markov decision processes. From a convex-analytic perspective, we extend prior results on reward identifiability and generalizability to both the constrained setting and a more general class of regularizations. In particular, we show that identifiability up to potential shaping (Cao et al., 2021) is a consequence of entropy regularization and may generally no longer hold for other regularizations or in the presence of safety constraints. We also show that to ensure generalizability to new transition laws and constraints, the true reward must be identified up to a constant. Additionally, we derive a finite sample guarantee for the suboptimality of the learned rewards, and validate our results in a gridworld environment.
OCMar 19, 2019
Actuator Placement for Optimizing Network Performance under Controllability ConstraintsBaiwei Guo, Orcun Karaca, Tyler Summers et al.
With the rising importance of large-scale network control, the problem of actuator placement has received increasing attention. Our goal in this paper is to find a set of actuators minimizing the metric that measures the average energy consumption of the control inputs while ensuring structural controllability of the network. As this problem is intractable, greedy algorithm can be used to obtain an approximate solution. To provide a performance guarantee for this approach, we first define the submodularity ratio for the metric under consideration and then reformulate the structural controllability constraint as a matroid constraint. This shows that the problem under study can be characterized by a matroid optimization involving a weakly submodular objective function. Then, we derive a novel performance guarantee for the greedy algorithm applied to this class of optimization problems. Finally, we show that the matroid feasibility check for the greedy algorithm can be cast as a maximum matching problem in a certain auxiliary bipartite graph related to the network graph.
SYJan 10, 2021
Using Uncertainty Data in Chance-Constrained Trajectory PlanningVasileios Lefkopoulos, Maryam Kamgarpour
We consider the problem of trajectory planning in an environment comprised of a set of obstacles with uncertain locations. While previous approaches model the uncertainties with a prescribed Gaussian distribution, we consider the realistic case in which the distribution's moments are unknown and are learned online. We derive tight concentration bounds on the error of the estimated moments. These bounds are then used to derive a tractable and tight mixed-integer convex reformulation of the trajectory planning problem, assuming linear dynamics and polyhedral constraints. The solution of the resulting optimization program is a feasible solution for the original problem with high confidence. We illustrate the approach with a case study from autonomous driving.
SYMar 9, 2019
Unified Approach to Convex Robust Distributed Control given Arbitrary Information StructuresLuca Furieri, Maryam Kamgarpour
We consider the problem of computing optimal linear control policies for linear systems in finite-horizon. The states and the inputs are required to remain inside pre-specified safety sets at all times despite unknown disturbances. In this technical note, we focus on the requirement that the control policy is distributed, in the sense that it can only be based on partial information about the history of the outputs. It is well-known that when a condition denoted as Quadratic Invariance (QI) holds, the optimal distributed control policy can be computed in a tractable way. Our goal is to unify and generalize the class of information structures over which quadratic invariance is equivalent to a test over finitely many binary matrices. The test we propose certifies convexity of the output-feedback distributed control problem in finite-horizon given any arbitrarily defined information structure, including the case of time varying communication networks and forgetting mechanisms. Furthermore, the framework we consider allows for including polytopic constraints on the states and the inputs in a natural way, without affecting convexity.
SYMar 9, 2019
Robust Distributed Control Beyond Quadratic InvarianceLuca Furieri, Maryam Kamgarpour
The problem of robust distributed control arises in several large-scale systems, such as transportation networks and power grid systems. In many practical scenarios controllers might not have enough information to make globally optimal decisions in a tractable way. We propose a novel class of tractable optimization problems whose solution is a controller complying with any specified information structure. The approach we suggest is based on decomposing intractable information constraints into two subspace constraints in the disturbance feedback domain. We discuss how to perform the decomposition in an optimized way. The resulting control policy is globally optimal when a condition known as Quadratic Invariance (QI) holds, whereas it is feasible and it provides a provable upper bound on the minimum cost when QI does not hold. Finally, we show that our method can lead to improved performance guarantees with respect to previous approaches, by applying the developed techniques to the platooning of autonomous vehicles.
LGMay 14
Fast Rates for Inverse Reinforcement LearningAndreas Schlaginhaufen, Maryam Kamgarpour
We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimation (MLE) and Min-Max-IRL are equivalent at the population level, and at the empirical level under deterministic dynamics. On the statistical side, exploiting pseudo-self-concordance of the Min-Max-IRL loss, we prove that both the trajectory-level KL divergence and the squared parameter error in the Hessian norm decay at the fast rate $\mathcal{O}(n^{-1})$, where $n$ is the number of expert trajectories. Our guarantees apply under misspecification and require no exploration assumptions. We further extend reward-identifiability results to general Borel spaces and derive novel results on the derivatives of the soft-optimal value function with respect to reward parameters.
GTMar 18
Eliciting Truthful Feedback for Preference-Based Learning via the VCG MechanismLeo Landolt, Anna Maddux, Andreas Schlaginhaufen et al.
We study resource allocation problems in which a central planner allocates resources among strategic agents with private cost functions in order to minimize a social cost, defined as an aggregate of the agents' costs. This setting poses two main challenges: (i) the agents' cost functions may be unknown to them or difficult to specify explicitly, and (ii) agents may misreport their costs strategically. To address these challenges, we propose an algorithm that combines preference-based learning with Vickrey-Clarke-Groves (VCG) payments to incentivize truthful reporting. Our algorithm selects informative preference queries via D-optimal design, estimates cost parameters through maximum likelihood, and computes VCG allocations and payments based on these estimates. In a one-shot setting, we prove that the mechanism is approximately truthful, individually rational, and efficient up to an error of $\tilde{\mathcal O}(K^{-1/2})$ for $K$ preference queries per agent. In an online setting, these guarantees hold asymptotically with sublinear regret at a rate of $\tilde{\mathcal O}(T^{2/3})$ after $T$ rounds. Finally, we validate our approach through a numerical case study on demand response in local electricity markets.
LGJan 29
Constrained Meta Reinforcement Learning with Provable Test-Time SafetyTingting Ni, Maryam Kamgarpour
Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in improving sample complexity on test tasks, many real-world applications, such as robotics and healthcare, impose safety constraints during testing. Constrained meta RL provides a promising framework for integrating safety into meta RL. An open question in constrained meta RL is how to ensure the safety of the policy on the real-world test task, while reducing the sample complexity and thus, enabling faster learning of optimal policies. To address this gap, we propose an algorithm that refines policies learned during training, with provable safety and sample complexity guarantees for learning a near optimal policy on the test tasks. We further derive a matching lower bound, showing that this sample complexity is tight.
GTMay 7
Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled DynamicsPhilip Jordan, Maryam Kamgarpour
We study Nash equilibrium learning in partially observable Markov games (POMGs), a multi-agent reinforcement learning framework in which agents cannot fully observe the underlying state. Prior work in this setting relies on centralization or information sharing, and suffers from sample and computational complexity that scales exponentially in the number of players. We focus on a subclass of POMGs with independent state transitions, where agents remain coupled through their rewards, and assume that the underlying fully observed Markov game is a Markov potential game. For this class, we present an independent learning algorithm in which players, observing only their own actions and observations and without communication, jointly converge to an approximate Nash equilibrium. Due to partial observability, optimal policies may in general depend on the full action-observation history. Under a filter stability assumption, we show that policies based on finite history windows provide sufficient approximation guarantees. This enables us to approximate the POMG by a surrogate Markov game that is near-potential, leading to quasi-polynomial sample and computational complexity for independent Nash equilibrium learning in the underlying POMG.
LGApr 1
Model-Based Learning of Near-Optimal Finite-Window Policies in POMDPsPhilip Jordan, Maryam Kamgarpour
We study model-based learning of finite-window policies in tabular partially observable Markov decision processes (POMDPs). A common approach to learning under partial observability is to approximate unbounded history dependencies using finite action-observation windows. This induces a finite-state Markov decision process (MDP) over histories, referred to as the superstate MDP. Once a model of this superstate MDP is available, standard MDP algorithms can be used to compute optimal policies, motivating the need for sample-efficient model estimation. Estimating the superstate MDP model is challenging because trajectories are generated by interaction with the original POMDP, creating a mismatch between the sampling process and target model. We propose a model estimation procedure for tabular POMDPs and analyze its sample complexity. Our analysis exploits a connection between filter stability and concentration inequalities for weakly dependent random variables. As a result, we obtain tight sample complexity guarantees for estimating the superstate MDP model from a single trajectory. Combined with value iteration, this yields approximately optimal finite-window policies for the POMDP.
GTDec 13, 2023
Learning in Zero-Sum Markov Games: Relaxing Strong Reachability and Mixing Time AssumptionsReda Ouhamma, Maryam Kamgarpour
We address payoff-based decentralized learning in infinite-horizon zero-sum Markov games. In this setting, each player makes decisions based solely on received rewards, without observing the opponent's strategy or actions nor sharing information. Prior works established finite-time convergence to an approximate Nash equilibrium under strong reachability and mixing time assumptions. We propose a convergent algorithm that significantly relaxes these assumptions, requiring only the existence of a single policy (not necessarily known) with bounded reachability and mixing time. Our key technical novelty is introducing Tsallis entropy regularization to smooth the best-response policy updates. By suitably tuning this regularization, we ensure sufficient exploration, thus bypassing previous stringent assumptions on the MDP. By establishing novel properties of the value and policy updates induced by the Tsallis entropy regularizer, we prove finite-time convergence to an approximate Nash equilibrium.
LGJun 11, 2025
Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental DesignAndreas Schlaginhaufen, Reda Ouhamma, Maryam Kamgarpour
We study reinforcement learning from human feedback in general Markov decision processes, where agents learn from trajectory-level preference comparisons. A central challenge in this setting is to design algorithms that select informative preference queries to identify the underlying reward while ensuring theoretical guarantees. We propose a meta-algorithm based on randomized exploration, which avoids the computational challenges associated with optimistic approaches and remains tractable. We establish both regret and last-iterate guarantees under mild reinforcement learning oracle assumptions. To improve query complexity, we introduce and analyze an improved algorithm that collects batches of trajectory pairs and applies optimal experimental design to select informative comparison queries. The batch structure also enables parallelization of preference queries, which is relevant in practical deployment as feedback can be gathered concurrently. Empirical evaluation confirms that the proposed method is competitive with reward-based reinforcement learning while requiring a small number of preference queries.
OCDec 21, 2024
A learning-based approach to stochastic optimal control under reach-avoid constraintTingting Ni, Maryam Kamgarpour
We develop a model-free approach to optimally control stochastic, Markovian systems subject to a reach-avoid constraint. Specifically, the state trajectory must remain within a safe set while reaching a target set within a finite time horizon. Due to the time-dependent nature of these constraints, we show that, in general, the optimal policy for this constrained stochastic control problem is non-Markovian, which increases the computational complexity. To address this challenge, we apply the state-augmentation technique from arXiv:2402.19360, reformulating the problem as a constrained Markov decision process (CMDP) on an extended state space. This transformation allows us to search for a Markovian policy, avoiding the complexity of non-Markovian policies. To learn the optimal policy without a system model, and using only trajectory data, we develop a log-barrier policy gradient approach. We prove that under suitable assumptions, the policy parameters converge to the optimal parameters, while ensuring that the system trajectories satisfy the stochastic reach-avoid constraint with high probability.
LGMar 25, 2024
Convergence of a model-free entropy-regularized inverse reinforcement learning algorithmTitouan Renard, Andreas Schlaginhaufen, Tingting Ni et al.
Given a dataset of expert demonstrations, inverse reinforcement learning (IRL) aims to recover a reward for which the expert is optimal. This work proposes a model-free algorithm to solve entropy-regularized IRL problem. In particular, we employ a stochastic gradient descent update for the reward and a stochastic soft policy iteration update for the policy. Assuming access to a generative model, we prove that our algorithm is guaranteed to recover a reward for which the expert is $\varepsilon$-optimal using $\mathcal{O}(1/\varepsilon^{2})$ samples of the Markov decision process (MDP). Furthermore, with $\mathcal{O}(1/\varepsilon^{4})$ samples we prove that the optimal policy corresponding to the recovered reward is $\varepsilon$-close to the expert policy in total variation distance.
SYMar 24
Uncertainty and Autarky: Cooperative Game Theory for Stable Local Energy Market PartitioningSaurabh Vaishampayan, Maryam Kamgarpour
Local energy markets empower prosumers to form coalitions for energy trading. However, the optimal partitioning of the distribution grid into such coalitions remains unclear, especially in constrained grids with stochastic production and consumption. This analysis must take into account the interests of both the grid operator and the constituent prosumers. In this work, we present a cooperative game theoretic framework to study distribution grid partitioning into local energy market coalitions under uncertain prosumption and grid constraints. We formulate the optimal stable partitioning problem to balance the interests of the grid operator with that of prosumers. Under deterministic load and generation, we show that the largest market coalition is the optimal stable partition. For the case of stochastic loads and generation, we provide an algorithm to evaluate the optimal stable partition. Numerical experiments are performed on benchmark and real world distribution grids. Our results help in understanding how uncertainty affects local energy market partitioning decisions in constrained distribution grids.
GTSep 17, 2025
Nash Equilibria in Games with Playerwise Concave Coupling Constraints: Existence and ComputationPhilip Jordan, Maryam Kamgarpour
We study the existence and computation of Nash equilibria in continuous static games where the players' admissible strategies are subject to shared coupling constraints, i.e., constraints that depend on their \emph{joint} strategies. Specifically, we focus on a class of games characterized by playerwise concave utilities and playerwise concave constraints. Prior results on the existence of Nash equilibria are not applicable to this class, as they rely on strong assumptions such as joint convexity of the feasible set. By leveraging topological fixed point theory and novel structural insights into the contractibility of feasible sets under playerwise concave constraints, we give an existence proof for Nash equilibria under weaker conditions. Having established existence, we then focus on the computation of Nash equilibria via independent gradient methods under the additional assumption that the utilities admit a potential function. To account for the possibly nonconvex feasible region, we employ a log barrier regularized gradient ascent with adaptive stepsizes. Starting from an initial feasible strategy profile and under exact gradient feedback, the proposed method converges to an $ε$-approximate constrained Nash equilibrium within $\mathcal{O}(ε^{-3})$ iterations.
LGJun 3, 2024
Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement LearningAndreas Schlaginhaufen, Maryam Kamgarpour
Inverse reinforcement learning (IRL) aims to infer a reward from expert demonstrations, motivated by the idea that the reward, rather than the policy, is the most succinct and transferable description of a task [Ng et al., 2000]. However, the reward corresponding to an optimal policy is not unique, making it unclear if an IRL-learned reward is transferable to new transition laws in the sense that its optimal policy aligns with the optimal policy corresponding to the expert's true reward. Past work has addressed this problem only under the assumption of full access to the expert's policy, guaranteeing transferability when learning from two experts with the same reward but different transition laws that satisfy a specific rank condition [Rolland et al., 2022]. In this work, we show that the conditions developed under full access to the expert's policy cannot guarantee transferability in the more practical scenario where we have access only to demonstrations of the expert. Instead of a binary rank condition, we propose principal angles as a more refined measure of similarity and dissimilarity between transition laws. Based on this, we then establish two key results: 1) a sufficient condition for transferability to any transition laws when learning from at least two experts with sufficiently different transition laws, and 2) a sufficient condition for transferability to local changes in the transition law when learning from a single expert. Furthermore, we also provide a probably approximately correct (PAC) algorithm and an end-to-end analysis for learning transferable rewards from demonstrations of multiple experts.
SYAug 5, 2021
Safe Motion Planning against Multimodal Distributions based on a Scenario ApproachHeejin Ahn, Colin Chen, Ian M. Mitchell et al.
We present the design of a motion planning algorithm that ensures safety for an autonomous vehicle. In particular, we consider a multimodal distribution over uncertainties; for example, the uncertain predictions of future trajectories of surrounding vehicles reflect discrete decisions, such as turning or going straight at intersections. We develop a computationally efficient, scenario-based approach that solves the motion planning problem with high confidence given a quantifiable number of samples from the multimodal distribution. Our approach is based on two preprocessing steps, which 1) separate the samples into distinct clusters and 2) compute a bounding polytope for each cluster. Then, we rewrite the motion planning problem approximately as a mixed-integer problem using the polytopes. We demonstrate via simulation on the nuScenes dataset that our approach ensures safety with high probability in the presence of multimodal uncertainties, and is computationally more efficient and less conservative than a conventional scenario approach.
GTJul 13, 2021
Contextual Games: Multi-Agent Learning with Side InformationPier Giuseppe Sessa, Ilija Bogunovic, Andreas Krause et al.
We formulate the novel class of contextual games, a type of repeated games driven by contextual information at each round. By means of kernel-based regularity assumptions, we model the correlation between different contexts and game outcomes and propose a novel online (meta) algorithm that exploits such correlations to minimize the contextual regret of individual players. We define game-theoretic notions of contextual Coarse Correlated Equilibria (c-CCE) and optimal contextual welfare for this new class of games and show that c-CCEs and optimal welfare can be approached whenever players' contextual regrets vanish. Finally, we empirically validate our results in a traffic routing experiment, where our algorithm leads to better performance and higher welfare compared to baselines that do not exploit the available contextual information or the correlations present in the game.
ROMar 2, 2021
Multi-robot task allocation for safe planning against stochastic hazard dynamicsDaniel Tihanyi, Yimeng Lu, Orcun Karaca et al.
We address multi-robot safe mission planning in uncertain dynamic environments. This problem arises in several applications including safety-critical exploration, surveillance, and emergency rescue missions. Computation of a multi-robot optimal control policy is challenging not only because of the complexity of incorporating dynamic uncertainties while planning, but also because of the exponential growth in problem size as a function of number of robots. Leveraging recent works obtaining a tractable safety maximizing plan for a single robot, we propose a scalable two-stage framework to solve the problem at hand. Specifically, the problem is split into a low-level single-agent control problem and a high-level task allocation problem. The low-level problem uses an efficient approximation of stochastic reachability for a Markov decision process to derive the optimal control policy under dynamic uncertainty. The task allocation is solved using polynomial-time forward and reverse greedy heuristics and in a distributed auction-based manner. By leveraging the properties of our safety objective function, we provide provable performance bounds on the safety of the approximate solutions proposed by these two heuristics. We evaluate the theory with extensive numerical case studies.
LGJul 10, 2020
Learning to Play Sequential Games versus Unknown OpponentsPier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour et al.
We consider a repeated sequential game between a learner, who plays first, and an opponent who responds to the chosen action. We seek to design strategies for the learner to successfully interact with the opponent. While most previous approaches consider known opponent models, we focus on the setting in which the opponent's model is unknown. To this end, we use kernel-based regularity assumptions to capture and exploit the structure in the opponent's response. We propose a novel algorithm for the learner when playing against an adversarial sequence of opponents. The algorithm combines ideas from bilevel optimization and online learning to effectively balance between exploration (learning about the opponent's model) and exploitation (selecting highly rewarding actions for the learner). Our results include algorithm's regret guarantees that depend on the regularity of the opponent's response and scale sublinearly with the number of game rounds. Moreover, we specialize our approach to repeated Stackelberg games, and empirically demonstrate its effectiveness in a traffic routing and wildlife conservation task
ROMar 5, 2020
Safe Mission Planning under Dynamical UncertaintiesYimeng Lu, Maryam Kamgarpour
This paper considers safe robot mission planning in uncertain dynamical environments. This problem arises in applications such as surveillance, emergency rescue, and autonomous driving. It is a challenging problem due to modeling and integrating dynamical uncertainties into a safe planning framework, and finding a solution in a computationally tractable way. In this work, we first develop a probabilistic model for dynamical uncertainties. Then, we provide a framework to generate a path that maximizes safety for complex missions by incorporating the uncertainty model. We also devise a Monte Carlo method to obtain a safe path efficiently. Finally, we evaluate the performance of our approach and compare it to potential alternatives in several case studies.
LGFeb 28, 2020
Mixed Strategies for Robust Optimization of Unknown ObjectivesPier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour et al.
We consider robust optimization problems, where the goal is to optimize an unknown objective function against the worst-case realization of an uncertain parameter. For this setting, we design a novel sample-efficient algorithm GP-MRO, which sequentially learns about the unknown objective from noisy point evaluations. GP-MRO seeks to discover a robust and randomized mixed strategy, that maximizes the worst-case expected objective value. To achieve this, it combines techniques from online learning with nonparametric confidence bounds from Gaussian processes. Our theoretical results characterize the number of samples required by GP-MRO to discover a robust near-optimal mixed strategy for different GP kernels of interest. We experimentally demonstrate the performance of our algorithm on synthetic datasets and on human-assisted trajectory planning tasks for autonomous vehicles. In our simulations, we show that robust deterministic strategies can be overly conservative, while the mixed strategies found by GP-MRO significantly improve the overall performance.
LGSep 18, 2019
No-Regret Learning in Unknown Games with Correlated PayoffsPier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour et al.
We consider the problem of learning to play a repeated multi-agent game with an unknown reward function. Single player online learning algorithms attain strong regret bounds when provided with full information feedback, which unfortunately is unavailable in many real-world scenarios. Bandit feedback alone, i.e., observing outcomes only for the selected action, yields substantially worse performance. In this paper, we consider a natural model where, besides a noisy measurement of the obtained reward, the player can also observe the opponents' actions. This feedback model, together with a regularity assumption on the reward function, allows us to exploit the correlations among different game outcomes by means of Gaussian processes (GPs). We propose a novel confidence-bound based bandit algorithm GP-MW, which utilizes the GP model for the reward function and runs a multiplicative weight (MW) method. We obtain novel kernel-dependent regret bounds that are comparable to the known bounds in the full information setting, while substantially improving upon the existing bandit results. We experimentally demonstrate the effectiveness of GP-MW in random matrix games, as well as real-world problems of traffic routing and movie recommendation. In our experiments, GP-MW consistently outperforms several baselines, while its performance is often comparable to methods that have access to full information feedback.
OCJun 13, 2018
Minimizing Regret of Bandit Online Optimization in Unconstrained Action SpacesTatiana Tatarenko, Maryam Kamgarpour
We consider online convex optimization with a zero-order oracle feedback. In particular, the decision maker does not know the explicit representation of the time-varying cost functions, or their gradients. At each time step, she observes the value of the corresponding cost function evaluated at her chosen action (zero-order oracle). The objective is to minimize the regret, that is, the difference between the sum of the costs she accumulates and that of a static optimal action had she known the sequence of cost functions a priori. We present a novel algorithm to minimize regret in unconstrained action spaces. Our algorithm hinges on a classical idea of one-point estimation of the gradients of the cost functions based on their observed values. The algorithm is independent of problem parameters. Letting $T$ denote the number of queries of the zero-order oracle and $n$ the problem dimension, the regret rate achieved is $O(n^{2/3}T^{2/3})$. Moreover, we adapt the presented algorithm to the setting with two-point feedback and demonstrate that the adapted procedure achieves the theoretical lower bound on the regret of $(n^{1/2}T^{1/2})$.
OCJun 17, 2017
Information Structure Design in Team Decision ProblemsTyler Summers, Changyuan Li, Maryam Kamgarpour
We consider a problem of information structure design in team decision problems and team games. We propose simple, scalable greedy algorithms for adding a set of extra information links to optimize team performance and resilience to non-cooperative and adversarial agents. We show via a simple counterexample that the set function mapping additional information links to team performance is in general not supermodular. Although this implies that the greedy algorithm is not accompanied by worst-case performance guarantees, we illustrate through numerical experiments that it can produce effective and often optimal or near optimal information structure modifications.