LGFeb 2, 2023
Performance Bounds for Policy-Based Average Reward Reinforcement Learning AlgorithmsYashaswini Murthy, Mehrdad Moharrami, R. Srikant
Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to $1,$ which is equivalent to making the expected horizon very large. However, the corresponding theoretical bounds for error performance scale with the square of the horizon. Thus, even after dividing the total reward by the length of the horizon, the corresponding performance bounds for average reward problems go to infinity. Therefore, an open problem has been to obtain meaningful performance bounds for approximate PI and RL algorithms for the average-reward setting. In this paper, we solve this open problem by obtaining the first finite-time error bounds for average-reward MDPs, and show that the asymptotic error goes to zero in the limit as policy evaluation and policy improvement errors go to zero.
LGFeb 8, 2023
On the Convergence of Modified Policy Iteration in Risk Sensitive Exponential Cost Markov Decision ProcessesYashaswini Murthy, Mehrdad Moharrami, R. Srikant
Modified policy iteration (MPI) is a dynamic programming algorithm that combines elements of policy iteration and value iteration. The convergence of MPI has been well studied in the context of discounted and average-cost MDPs. In this work, we consider the exponential cost risk-sensitive MDP formulation, which is known to provide some robustness to model parameters. Although policy iteration and value iteration have been well studied in the context of risk sensitive MDPs, MPI is unexplored. We provide the first proof that MPI also converges for the risk-sensitive problem in the case of finite state and action spaces. Since the exponential cost formulation deals with the multiplicative Bellman equation, our main contribution is a convergence proof which is quite different than existing results for discounted and risk-neutral average-cost problems as well as risk sensitive value and policy iteration approaches. We conclude our analysis with simulation results, assessing MPI's performance relative to alternative dynamic programming methods like value iteration and policy iteration across diverse problem parameters. Our findings highlight risk-sensitive MPI's enhanced computational efficiency compared to both value and policy iteration techniques.
63.1LGMar 31
Offline Constrained RLHF with Multiple Preference OraclesBrenden Latham, Mehrdad Moharrami
We study offline constrained reinforcement learning from human feedback with multiple preference oracles. Motivated by applications that trade off performance with safety or fairness, we aim to maximize target population utility subject to a minimum protected group welfare constraint. From pairwise comparisons collected under a reference policy, we estimate oracle-specific rewards via maximum likelihood and analyze how statistical uncertainty propagates through the dual program. We cast the constrained objective as a KL-regularized Lagrangian whose primal optimizer is a Gibbs policy, reducing learning to a convex dual problem. We propose a dual-only algorithm that ensures high-probability constraint satisfaction and provide the first finite-sample performance guarantees for offline constrained preference learning. Finally, we extend our theoretical analysis to accommodate multiple constraints and general f-divergence regularization.
LGNov 23, 2025
Tail Distribution of Regret in Optimistic Reinforcement LearningSajad Khodadadian, Mehrdad Moharrami
We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. Focusing on a UCBVI-type algorithm, we characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes, rather than only its expectation or a single high-probability quantile. We analyze two natural exploration-bonus schedules: (i) a $K$-dependent scheme that explicitly incorporates the total number of episodes $K$, and (ii) a $K$-independent scheme that depends only on the current episode index. For both settings, we obtain an upper bound on $\Pr(R_K \ge x)$ that exhibits a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale $m_K$ up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret $\mathbb{E}[R_K]$. The proposed algorithm depends on a tuning parameter $α$, which balances the expected regret and the range over which the regret exhibits a sub-Gaussian tail. To the best of our knowledge, our results provide one of the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning.
SYFeb 4, 2022
Learning to Admit Optimally in an $M/M/k/k+N$ Queueing System with Unknown Service RateSaghar Adler, Mehrdad Moharrami, Vijay Subramanian
Motivated by applications of the Erlang-B blocking model and the extended $M/M/k/k+N$ model that allows for some queueing, beyond communication networks to sizing and pricing in production, messaging, and app-based parking systems, we study admission control for such systems with unknown service rate. In our model, a dispatcher either admits every arrival into the system (when there is room) or blocks it. Every served job yields a fixed reward but incurs a per unit time holding cost which includes the waiting time in the queue to get service if there is any. We aim to design a dispatching policy that maximizes the long-term average reward by observing arrival times and system state at arrivals, a realistic decision-event driven sampling of such systems. The dispatcher observes neither service times nor departure epochs, which excludes the use of reward-based reinforcement learning approaches. We develop our learning-based dispatch scheme as a parametric learning problem a'la self-tuning adaptive control. In our problem, certainty equivalent control switches between always admit if room (explore infinitely often), and never admit (terminate learning), so at judiciously chosen times we avoid the never admit recommendation. We prove that our proposed policy asymptotically converges to the optimal policy and present finite-time regret guarantees. The extreme contrast in the control policies shows up in our regret bounds for different parameter regimes: constant in one versus logarithmic in another.