Chinmay Maheshwari

h-index5

13papers

125citations

Novelty60%

AI Score52

Ranked #15,649 of 194,257 authors (top 8%)#40 in OC (top 5%)

13 Papers

6.9LGOct 21, 2022

Competing Bandits in Time Varying Matching Markets

Deepan Muthirayan, Chinmay Maheshwari, Pramod P. Khargonekar et al.

We study the problem of online learning in two-sided non-stationary matching markets, where the objective is to converge to a stable match. In particular, we consider the setting where one side of the market, the arms, has fixed known set of preferences over the other side, the players. While this problem has been studied when the players have fixed but unknown preferences, in this work we study the problem of how to learn when the preferences of the players are time varying and unknown. Our contribution is a methodology that can handle any type of preference structure and variation scenario. We show that, with the proposed algorithm, each player receives a uniform sub-linear regret of {$\widetilde{\mathcal{O}}(L^{1/2}_TT^{1/2})$} up to the number of changes in the underlying preferences of the agents, $L_T$. Therefore, we show that the optimal rates for single-agent learning can be achieved in spite of the competition up to a difference of a constant factor. We also discuss extensions of this algorithm to the case where the number of changes need not be known a priori.

2.0OCSep 3, 2020

Stabilization under round robin scheduling of control inputs in nonlinear systems

Chinmay Maheshwari, Sukumar Srikant, Debasish Chatterjee

We study stability of multivariable control-affine nonlinear systems under sparsification of feedback controllers. Sparsification in our context refers to the scheduling of the individual control inputs one at a time in rapid periodic sweeps over the set of control inputs, which corresponds to round-robin scheduling. We prove that if a locally asymptotically stabilizing feedback controller is sparsified via the round-robin scheme and each control action is scaled appropriately, then the corresponding equilibrium of the resulting system is stabilized when the scheduling is sufficiently fast; under mild additional conditions, local asymptotic stabilization of the corresponding equilibrium can also be guaranteed. Moreover, the basin of attraction for the equilibrium of scheduled system also remains same as the original system under sufficiently fast switching. Our technical tools are derived from optimal control theory, and our results also contribute to the literature on the stability of switched systems in the fast switching regime. Illustrative numerical examples depicting several subtle features of our results are included.

16.1LGMay 29, 2022

Independent and Decentralized Learning in Markov Potential Games

Chinmay Maheshwari, Manxi Wu, Druv Pai et al.

We study a multi-agent reinforcement learning dynamics, and analyze its asymptotic behavior in infinite-horizon discounted Markov potential games. We focus on the independent and decentralized setting, where players do not know the game parameters, and cannot communicate or coordinate. In each stage, players update their estimate of Q-function that evaluates their total contingent payoff based on the realized one-stage reward in an asynchronous manner. Then, players independently update their policies by incorporating an optimal one-stage deviation strategy based on the estimated Q-function. Inspired by the actor-critic algorithm in single-agent reinforcement learning, a key feature of our learning dynamics is that agents update their Q-function estimates at a faster timescale than the policies. Leveraging tools from two-timescale asynchronous stochastic approximation theory, we characterize the convergent set of learning dynamics.

16.0AIJun 6, 2022

Decentralized, Communication- and Coordination-free Learning in Structured Matching Markets

Chinmay Maheshwari, Eric Mazumdar, Shankar Sastry

We study the problem of online learning in competitive settings in the context of two-sided matching markets. In particular, one side of the market, the agents, must learn about their preferences over the other side, the firms, through repeated interaction while competing with other agents for successful matches. We propose a class of decentralized, communication- and coordination-free algorithms that agents can use to reach to their stable match in structured matching markets. In contrast to prior works, the proposed algorithms make decisions based solely on an agent's own history of play and requires no foreknowledge of the firms' preferences. Our algorithms are constructed by splitting up the statistical problem of learning one's preferences, from noisy observations, from the problem of competing for firms. We show that under realistic structural assumptions on the underlying preferences of the agents and firms, the proposed algorithms incur a regret which grows at most logarithmically in the time horizon. Our results show that, in the case of matching markets, competition need not drastically affect the performance of decentralized, communication and coordination free online learning algorithms.

1.2SYOct 3, 2019

On optimal multiplexing of an ensemble of discrete-time constrained control systems on matrix Lie groups

Chinmay Maheshwari, Sukumar Srikant, Debasish Chatterjee

We study a constrained optimal control problem for an ensemble of control systems. Each sub-system (or plant) evolves on a matrix Lie group, and must satisfy given state and control action constraints pointwise in time. In addition, certain multiplexing requirement is imposed: the controller must be shared between the plants in the sense that at any time instant the control signal may be sent to only one plant. We provide first-order necessary conditions for optimality in the form of suitable Pontryagin maximum principle in this problem. Detailed numerical experiments are presented for a system of two satellites performing energy optimal maneuvers under the preceding family of constraints.

7.6OCFeb 2, 2023

Follower Agnostic Methods for Stackelberg Games

Chinmay Maheshwari, James Cheng, S. Shankar Sasty et al.

In this paper, we present an efficient algorithm to solve online Stackelberg games, featuring multiple followers, in a follower-agnostic manner. Unlike previous works, our approach works even when leader has no knowledge about the followers' utility functions or strategy space. Our algorithm introduces a unique gradient estimator, leveraging specially designed strategies to probe followers. In a departure from traditional assumptions of optimal play, we model followers' responses using a convergent adaptation rule, allowing for realistic and dynamic interactions. The leader constructs the gradient estimator solely based on observations of followers' actions. We provide both non-asymptotic convergence rates to stationary points of the leader's objective and demonstrate asymptotic convergence to a \emph{local Stackelberg equilibrium}. To validate the effectiveness of our algorithm, we use this algorithm to solve the problem of incentive design on a large-scale transportation network, showcasing its robustness even when the leader lacks access to followers' demand.

3.3MASep 6, 2024

Convergence of Decentralized Actor-Critic Algorithm in General-sum Markov Games

Chinmay Maheshwari, Manxi Wu, Shankar Sastry

Markov games provide a powerful framework for modeling strategic multi-agent interactions in dynamic environments. Traditionally, convergence properties of decentralized learning algorithms in these settings have been established only for special cases, such as Markov zero-sum and potential games, which do not fully capture real-world interactions. In this paper, we address this gap by studying the asymptotic properties of learning algorithms in general-sum Markov games. In particular, we focus on a decentralized algorithm where each agent adopts an actor-critic learning dynamic with asynchronous step sizes. This decentralized approach enables agents to operate independently, without requiring knowledge of others' strategies or payoffs. We introduce the concept of a Markov Near-Potential Function (MNPF) and demonstrate that it serves as an approximate Lyapunov function for the policy updates in the decentralized learning dynamics, which allows us to characterize the convergent set of strategies. We further strengthen our result under specific regularity conditions and with finite Nash equilibria.

7.1MAApr 6

Nash Approximation Gap in Truncated Infinite-horizon Partially Observable Markov Games

Lan Sang, Chinmay Maheshwari

Partially Observable Markov Games (POMGs) provide a general framework for modeling multi-agent sequential decision-making under asymmetric information. A common approach is to reformulate a POMG as a fully observable Markov game over belief states, where the state is the conditional distribution of the system state and agents' private information given common information, and actions correspond to mappings (prescriptions) from private information to actions. However, this reformulation is intractable in infinite-horizon settings, as both the belief state and action spaces grow with the accumulation of information over time. We propose a finite-memory truncation framework that approximates infinite-horizon POMGs by a finite-state, finite-action Markov game, where agents condition decisions only on finite windows of common and private information. Under suitable filter stability (forgetting) conditions, we show that any Nash equilibrium of the truncated game is an $\varepsilon$-Nash equilibrium of the original POMG, where $\varepsilon \to 0$ as the truncation length increases.

5.7MAApr 5

Decentralized Ergodic Coverage Control in Unknown Time-Varying Environments

Maria G. Mendoza, Victoria Marie Tuck, Chinmay Maheshwari et al.

A key challenge in disaster response is maintaining situational awareness of an evolving landscape, which requires balancing exploration of unobserved regions with sustained monitoring of changing Regions of Interest (ROIs). Unmanned Aerial Vehicles (UAVs) have emerged as an effective response tool, particularly in applications like environmental monitoring and search-and-rescue, due to their ability to provide aerial coverage, withstand hazardous conditions, and navigate quickly and flexibly. However, efficient and adaptable multi-robot coverage with limited sensing in disaster settings and evolving time-varying information maps remains a significant challenge, necessitating better methods for UAVs to continuously adapt their trajectories in response to changes. In this paper, we propose a decentralized multi-agent coverage framework that serves as a high-level planning strategy for adaptive coverage in unknown, time-varying environments under partial observability. Each agent computes an adaptive ergodic policy, implemented via a Markov-chain transition model, that tracks a continuously updated belief over the underlying importance map. Gaussian Processes are used to perform those online belief updates. The resulting policy drives agents to spend time in ROIs proportional to their estimated importance, while preserving sufficient exploration to detect and adapt to time-varying environmental changes. Unlike existing approaches that assume known importance maps, require centralized coordination, or assume a static environment, our framework addresses the combined challenges of unknown, time-varying distributions in a more realistic decentralized and partially observable setting. We compare against alternative coverage strategies and analyze our method's response to simulated disaster evolution, highlighting its improved adaptability and transient performance in dynamic scenarios.

7.4LGMar 7

NePPO: Near-Potential Policy Optimization for General-Sum Multi-Agent Reinforcement Learning

Addison Kalanther, Sanika Bharvirkar, Shankar Sastry et al.

Multi-agent reinforcement learning (MARL) is increasingly used to design learning-enabled agents that interact in shared environments. However, training MARL algorithms in general-sum games remains challenging: learning dynamics can become unstable, and convergence guarantees typically hold only in restricted settings such as two-player zero-sum or fully cooperative games. Moreover, when agents have heterogeneous and potentially conflicting preferences, it is unclear what system-level objective should guide learning. In this paper, we propose a new MARL pipeline called Near-Potential Policy Optimization (NePPO) for computing approximate Nash equilibria in mixed cooperative--competitive environments. The core idea is to learn a player-independent potential function such that the Nash equilibrium of a cooperative game with this potential as the common utility approximates a Nash equilibrium of the original game. To this end, we introduce a novel MARL objective such that minimizing this objective yields the best possible potential function candidate and consequently an approximate Nash equilibrium of the original game. We develop an algorithmic pipeline that minimizes this objective using zeroth-order gradient descent and returns an approximate Nash equilibrium policy. We empirically show the superior performance of this approach compared to popular baselines such as MAPPO, IPPO and MADDPG.

7.1OCAug 17, 2025

EXOTIC: An Exact, Optimistic, Tree-Based Algorithm for Min-Max Optimization

Chinmay Maheshwari, Chinmay Pimpalkhare, Debasish Chatterjee

Min-max optimization arises in many domains such as game theory, adversarial machine learning, etc., with gradient-based methods as a typical computational tool. Beyond convex-concave min-max optimization, the solutions found by gradient-based methods may be arbitrarily far from global optima. In this work, we present an algorithmic apparatus for computing globally optimal solutions in convex-non-concave and non-convex-concave min-max optimization. For former, we employ a reformulation that transforms it into a non-concave-convex max-min optimization problem with suitably defined feasible sets and objective function. The new form can be viewed as a generalization of Sion's minimax theorem. Next, we introduce EXOTIC-an Exact, Optimistic, Tree-based algorithm for solving the reformulated max-min problem. EXOTIC employs an iterative convex optimization solver to (approximately) solve the inner minimization and a hierarchical tree search for the outer maximization to optimistically select promising regions to search based on the approximate solution returned by convex optimization solver. We establish an upper bound on its optimality gap as a function of the number of calls to the inner solver, the solver's convergence rate, and additional problem-dependent parameters. Both our algorithmic apparatus along with its accompanying theoretical analysis can also be applied for non-convex-concave min-max optimization. In addition, we propose a class of benchmark convex-non-concave min-max problems along with their analytical global solutions, providing a testbed for evaluating algorithms for min-max optimization. Empirically, EXOTIC outperforms gradient-based methods on this benchmark as well as on existing numerical benchmark problems from the literature. Finally, we demonstrate the utility of EXOTIC by computing security strategies in multi-player games with three or more players.

8.0GTMay 21, 2023

Markov $α$-Potential Games

Xin Guo, Xinyu Li, Chinmay Maheshwari et al.

We propose a new framework of Markov $α$-potential games to study Markov games. We show that any Markov game with finite-state and finite-action is a Markov $α$-potential game, and establish the existence of an associated $α$-potential function. Any optimizer of an $α$-potential function is shown to be an $α$-stationary Nash equilibrium. We study two important classes of practically significant Markov games, Markov congestion games and the perturbed Markov team games, via the framework of Markov $α$-potential games, with explicit characterization of an upper bound for $α$ and its relation to game parameters. Additionally, we provide a semi-infinite linear programming based formulation to obtain an upper bound for $α$ for any Markov game. Furthermore, we study two equilibrium approximation algorithms, namely the projected gradient-ascent algorithm and the sequential maximum improvement algorithm, along with their Nash regret analysis, and corroborate the results with numerical experiments.

12.0OCJun 16, 2021

Zeroth-Order Methods for Convex-Concave Minmax Problems: Applications to Decision-Dependent Risk Minimization

Chinmay Maheshwari, Chih-Yuan Chiu, Eric Mazumdar et al.

Min-max optimization is emerging as a key framework for analyzing problems of robustness to strategically and adversarially generated data. We propose a random reshuffling-based gradient free Optimistic Gradient Descent-Ascent algorithm for solving convex-concave min-max problems with finite sum structure. We prove that the algorithm enjoys the same convergence rate as that of zeroth-order algorithms for convex minimization problems. We further specialize the algorithm to solve distributionally robust, decision-dependent learning problems, where gradient information is not readily available. Through illustrative simulations, we observe that our proposed approach learns models that are simultaneously robust against adversarial distribution shifts and strategic decisions from the data sources, and outperforms existing methods from the strategic classification literature.