Arnob Ghosh

h-index16

14papers

232citations

Novelty60%

AI Score42

Ranked #61,135 of 194,257 authors (top 31%)#13,827 in LG (top 34%)

14 Papers

20.5LGJun 23, 2022

Provably Efficient Model-Free Constrained RL with Linear Function Approximation

Arnob Ghosh, Xingyu Zhou, Ness Shroff

We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.

18.8LGMar 10, 2023

Provably Efficient Model-Free Algorithms for Non-stationary CMDPs

Honghao Wei, Arnob Ghosh, Ness Shroff et al.

We study model-free reinforcement learning (RL) algorithms in episodic non-stationary constrained Markov Decision Processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a cumulative constraint on the expected utility (cost). In the non-stationary environment, reward, utility functions, and transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets. We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs in both tabular and linear function approximation settings with provable performance guarantees. Our results on regret bound and constraint violation for the tabular case match the corresponding best results for stationary CMDPs when the total budget is known. Additionally, we present a general framework for addressing the well-known challenges associated with analyzing non-stationary CMDPs, without requiring prior knowledge of the variation budget. We apply the approach for both tabular and linear approximation settings.

2.9OCDec 1, 2016

Menu-Based Pricing for Charging of Electric Vehicles with Vehicle-to-Grid Service

Arnob Ghosh, Vaneet Aggarwal

The paper considers a bidirectional power flow model of the electric vehicles (EVs) in a charging station. The EVs can inject energies by discharging via a Vehicle-to-Grid (V2G) service which can enhance the profits of the charging station. However, frequent charging and discharging degrade battery life. A proper compensation needs to be paid to the users to participate in the V2G service. We propose a menu-based pricing scheme, where the charging station selects a price for each arriving user for the amount of battery utilization, the total energy, and the time (deadline) that the EV will stay. The user can accept one of the contracts or rejects all depending on their utilities. The charging station can serve users using a combination of the renewable energy and the conventional energy bought from the grid. We show that though there exists a profit maximizing price which maximizes the social welfare, it provides no surplus to the users if the charging station is aware of the utilities of the users. If the charging station is not aware of the exact utilities, the social welfare maximizing price may not maximize the expected profit. In fact, it can give a zero profit. We propose a pricing strategy which provides a guaranteed fixed profit to the charging station and it also maximizes the expected profit for a wide range of utility functions. Our analysis shows that when the harvested renewable energy is small the users have higher incentives for the V2G service. We, numerically, show that the charging station's profit and the user's surplus both increase as V2G service is efficiently utilized by the pricing mechanism.

9.8LGJun 1, 2023

Achieving Fairness in Multi-Agent Markov Decision Processes Using Reinforcement Learning

Peizhong Ju, Arnob Ghosh, Ness B. Shroff

Fairness plays a crucial role in various multi-agent systems (e.g., communication networks, financial markets, etc.). Many multi-agent dynamical interactions can be cast as Markov Decision Processes (MDPs). While existing research has focused on studying fairness in known environments, the exploration of fairness in such systems for unknown environments remains open. In this paper, we propose a Reinforcement Learning (RL) approach to achieve fairness in multi-agent finite-horizon episodic MDPs. Instead of maximizing the sum of individual agents' value functions, we introduce a fairness function that ensures equitable rewards across agents. Since the classical Bellman's equation does not hold when the sum of individual value functions is not maximized, we cannot use traditional approaches. Instead, in order to explore, we maintain a confidence bound of the unknown environment and then propose an online convex optimization based approach to obtain a policy constrained to this confidence region. We show that such an approach achieves sub-linear regret in terms of the number of episodes. Additionally, we provide a probably approximately correct (PAC) guarantee based on the obtained regret bound. We also propose an offline RL algorithm and bound the optimality gap with respect to the optimal fair solution. To mitigate computational complexity, we introduce a policy-gradient type method for the fair objective. Simulation experiments also demonstrate the efficacy of our approach.

3.3LGNov 28, 2022

Provably Efficient Model-free RL in Leader-Follower MDP with Linear Function Approximation

Arnob Ghosh

We consider a multi-agent episodic MDP setup where an agent (leader) takes action at each step of the episode followed by another agent (follower). The state evolution and rewards depend on the joint action pair of the leader and the follower. Such type of interactions can find applications in many domains such as smart grids, mechanism design, security, and policymaking. We are interested in how to learn policies for both the players with provable performance guarantee under a bandit feedback setting. We focus on a setup where both the leader and followers are {\em non-myopic}, i.e., they both seek to maximize their rewards over the entire episode and consider a linear MDP which can model continuous state-space which is very common in many RL applications. We propose a {\em model-free} RL algorithm and show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret bounds can be achieved for both the leader and the follower, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps under the bandit feedback information setup. Thus, our result holds even when the number of states becomes infinite. The algorithm relies on {\em novel} adaptation of the LSVI-UCB algorithm. Specifically, we replace the standard greedy policy (as the best response) with the soft-max policy for both the leader and the follower. This turns out to be key in establishing uniform concentration bound for the value functions. To the best of our knowledge, this is the first sub-linear regret bound guarantee for the Markov games with non-myopic followers with function approximation.

8.6AROct 27, 2024Code

SPICEPilot: Navigating SPICE Code Generation and Simulation with AI Guidance

Deepak Vungarala, Sakila Alam, Arnob Ghosh et al.

Large Language Models (LLMs) have shown great potential in automating code generation; however, their ability to generate accurate circuit-level SPICE code remains limited due to a lack of hardware-specific knowledge. In this paper, we analyze and identify the typical limitations of existing LLMs in SPICE code generation. To address these limitations, we present SPICEPilot a novel Python-based dataset generated using PySpice, along with its accompanying framework. This marks a significant step forward in automating SPICE code generation across various circuit configurations. Our framework automates the creation of SPICE simulation scripts, introduces standardized benchmarking metrics to evaluate LLM's ability for circuit generation, and outlines a roadmap for integrating LLMs into the hardware design process. SPICEPilot is open-sourced under the permissive MIT license at https://github.com/ACADLab/SPICEPilot.git.

13.0LGMay 25, 2025

Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees

Sourav Ganguly, Arnob Ghosh, Kishan Panaganti et al.

Constrained decision-making is essential for designing safe policies in real-world control systems, yet simulated environments often fail to capture real-world adversities. We consider the problem of learning a policy that will maximize the cumulative reward while satisfying a constraint, even when there is a mismatch between the real model and an accessible simulator/nominal model. In particular, we consider the robust constrained Markov decision problem (RCMDP) where an agent needs to maximize the reward and satisfy the constraint against the worst possible stochastic model under the uncertainty set centered around an unknown nominal model. Primal-dual methods, effective for standard constrained MDP (CMDP), are not applicable here because of the lack of the strong duality property. Further, one cannot apply the standard robust value-iteration based approach on the composite value function either as the worst case models may be different for the reward value function and the constraint value function. We propose a novel technique that effectively minimizes the constraint value function--to satisfy the constraints; on the other hand, when all the constraints are satisfied, it can simply maximize the robust reward value function. We prove that such an algorithm finds a policy with at most $ε$ sub-optimality and feasible policy after $O(ε^{-2})$ iterations. In contrast to the state-of-the-art method, we do not need to employ a binary search, thus, we reduce the computation time by at least 4x for smaller value of discount factor ($γ$) and by at least 6x for larger value of $γ$.

16.9LGFeb 25, 2025

Provably Efficient RL for Linear MDPs under Instantaneous Safety Constraints in Non-Convex Feature Spaces

Amirhossein Roknilamouki, Arnob Ghosh, Ming Shi et al.

In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is non-convex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form. In this paper, we establish a regret bound of $\tilde{\mathcal{O}}\bigl(\bigl(1 + \tfrac{1}τ\bigr) \sqrt{\log(\tfrac{1}τ) d^3 H^4 K} \bigr)$, applicable to both star-convex and non-star-convex cases, where $d$ is the feature dimension, $H$ the episode length, $K$ the number of episodes, and $τ$ the safety threshold. Moreover, the violation of safety constraints is zero with high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique called Objective Constraint-Decomposition (OCD) to properly bound the covering number. This result also resolves an error in a previous work on constrained RL. In non-star-convex scenarios, where the covering number can become infinitely large, we propose a two-phase algorithm, Non-Convex Safe Least Squares Value Iteration (NCS-LSVI), which first reduces uncertainty about the safe set by playing a known safe policy. After that, it carefully balances exploration and exploitation to achieve the regret bound. Finally, numerical simulations on an autonomous driving scenario demonstrate the effectiveness of NCS-LSVI.

11.5LGJan 1, 2024

Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning

Honghao Wei, Xiyue Peng, Arnob Ghosh et al.

We propose WSAC (Weighted Safe Actor-Critic), a novel algorithm for Safe Offline Reinforcement Learning (RL) under functional approximation, which can robustly optimize policies to improve upon an arbitrary reference policy with limited data coverage. WSAC is designed as a two-player Stackelberg game to optimize a refined objective function. The actor optimizes the policy against two adversarially trained value critics with small importance-weighted Bellman errors, which focus on scenarios where the actor's performance is inferior to the reference policy. In theory, we demonstrate that when the actor employs a no-regret optimization oracle, WSAC achieves a number of guarantees: (i) For the first time in the safe offline RL setting, we establish that WSAC can produce a policy that outperforms any reference policy while maintaining the same level of safety, which is critical to designing a safe algorithm for offline RL. (ii) WSAC achieves the optimal statistical convergence rate of $1/\sqrt{N}$ to the reference policy, where $N$ is the size of the offline dataset. (iii) We theoretically show that WSAC guarantees a safe policy improvement across a broad range of hyperparameters that control the degree of pessimism, indicating its practical robustness. Additionally, we offer a practical version of WSAC and compare it with existing state-of-the-art safe offline RL algorithms in several continuous control environments. WSAC outperforms all baselines across a range of tasks, supporting the theoretical results.

9.4LGOct 3, 2025

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee et al.

Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the utility of model outputs and mitigating their potential for harm is a complex and persistent challenge. Contemporary approaches frequently formalize this problem within the framework of Constrained Markov Decision Processes (CMDPs) and employ established CMDP optimization techniques. However, these methods exhibit two notable limitations. First, their reliance on reward and cost functions renders performance highly sensitive to the underlying scoring mechanism, which must capture semantic meaning rather than being triggered by superficial keywords. Second, CMDP-based training entails tuning dual-variable, a process that is both computationally expensive and does not provide any provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost model trained on a large-scale corpus to assign semantically grounded safety scores. In contrast to the lagrangian-based approach, CS-RLHF adopts a rectified penalty-based formulation. This design draws on the theory of exact penalty functions in constrained optimization, wherein constraint satisfaction is enforced directly through a suitably chosen penalty term. With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed at the optimizer, eliminating the need for dual-variable updates. Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art LLM model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts

5.9MADec 31, 2020

Model Free Reinforcement Learning Algorithm for Stationary Mean field Equilibrium for Multiple Types of Agents

Arnob Ghosh, Vaneet Aggarwal

We consider a multi-agent Markov strategic interaction over an infinite horizon where agents can be of multiple types. We model the strategic interaction as a mean-field game in the asymptotic limit when the number of agents of each type becomes infinite. Each agent has a private state; the state evolves depending on the distribution of the state of the agents of different types and the action of the agent. Each agent wants to maximize the discounted sum of rewards over the infinite horizon which depends on the state of the agent and the distribution of the state of the leaders and followers. We seek to characterize and compute a stationary multi-type Mean field equilibrium (MMFE) in the above game. We characterize the conditions under which a stationary MMFE exists. Finally, we propose Reinforcement learning (RL) based algorithm using policy gradient approach to find the stationary MMFE when the agents are unaware of the dynamics. We, numerically, evaluate how such kind of interaction can model the cyber attacks among defenders and adversaries, and show how RL based algorithm can converge to an equilibrium.

8.1LGMay 30, 2019

Reinforcement Learning for Mean Field Game

Mridul Agarwal, Vaneet Aggarwal, Arnob Ghosh et al.

Stochastic games provide a framework for interactions among multiple agents and enable a myriad of applications. In these games, agents decide on actions simultaneously, the state of every agent moves to the next state, and each agent receives a reward. However, finding an equilibrium (if exists) in this game is often difficult when the number of agents becomes large. This paper focuses on finding a mean-field equilibrium (MFE) in an action coupled stochastic game setting in an episodic framework. It is assumed that the impact of the other agents' can be assumed by the empirical distribution of the mean of the actions. All agents know the action distribution and employ lower-myopic best response dynamics to choose the optimal oblivious strategy. This paper proposes a posterior sampling based approach for reinforcement learning in the mean-field game, where each agent samples a transition probability from the previous transitions. We show that the policy and action distributions converge to the optimal oblivious strategy and the limiting distribution, respectively, which constitute an MFE.

5.1MMNov 30, 2018

A Robust Algorithm for Tile-based 360-degree Video Streaming with Uncertain FoV Estimation

Arnob Ghosh, Vaneet Aggarwal, Feng Qian

We propose a robust scheme for streaming 360-degree immersive videos to maximize the quality of experience (QoE). Our streaming approach introduces a holistic analytical framework built upon the formal method of stochastic optimization. We propose a robust algorithm which provides a streaming rate such that the video quality degrades below that rate with very low probability even in presence of uncertain head movement, and bandwidth. It assumes the knowledge of the viewing probability of different portions (tiles) of a panoramic scene. Such probabilities can be easily derived from crowdsourced measurements performed by 360 video content providers. We then propose efficient methods to solve the problem at runtime while achieving a bounded optimality gap (in terms of the QoE). We implemented our proposed approaches using emulation. Using real users' head movement traces and real cellular bandwidth traces, we show that our algorithms significantly outperform the baseline algorithms by at least in $30\%$ in the QoE metric. Our algorithm gives a streaming rate which is $50\%$ higher compared to the baseline algorithms when the prediction error is high.

5.9MMApr 26, 2017

A Rate Adaptation Algorithm for Tile-based 360-degree Video Streaming

Arnob Ghosh, Vaneet Aggarwal, Feng Qian

In the 360-degree immersive video, a user only views a part of the entire raw video frame based on her viewing direction. However, today's 360-degree video players always fetch the entire panoramic view regardless of users' head movement, leading to significant bandwidth waste that can be potentially avoided. In this paper, we propose a novel adaptive streaming scheme for 360-degree videos. The basic idea is to fetch the invisible portion of a video at the lowest quality based on users' head movement prediction and to adaptively decide the video playback quality for the visible portion based on bandwidth prediction. Doing both in a robust manner requires overcome a series of challenges, such as jointly considering the spatial and temporal domains, tolerating prediction errors, and achieving low complexity. To overcome these challenges, we first define quality of experience (QoE) metrics for adaptive 360-degree video streaming. We then formulate an optimization problem and solve it at a low complexity. The algorithm strategically leverages both future bandwidth and the distribution of users' head positions to determine the quality level of each tile (i.e., a sub-area of a raw frame). We further provide theoretical proof showing that our algorithm achieves optimality under practical assumptions. Numerical results show that our proposed algorithms significantly boost the user QoE by at least 20\% compared to baseline algorithms.