AIDec 12, 2025
Reliable Policy Iteration: Performance Robustness Across Architecture and Environment PerturbationsS. R. Eshwar, Aniruddha Mukherjee, Kintan Saha et al.
In a recent work, we proposed Reliable Policy Iteration (RPI), that restores policy iteration's monotonicity-of-value-estimates property to the function approximation setting. Here, we assess the robustness of RPI's empirical performance on two classical control tasks -- CartPole and Inverted Pendulum -- under changes to neural network and environmental parameters. Relative to DQN, Double DQN, DDPG, TD3, and PPO, RPI reaches near-optimal performance early and sustains this policy as training proceeds. Because deep RL methods are often hampered by sample inefficiency, training instability, and hyperparameter sensitivity, our results highlight RPI's promise as a more reliable alternative.
LGSep 16, 2024
Reinforcement Learning with Quasi-Hyperbolic DiscountingS. R. Eshwar, Mayank Motwani, Nibedita Roy et al.
Reinforcement learning has traditionally been studied with exponential discounting or the average reward setup, mainly due to their mathematical tractability. However, such frameworks fall short of accurately capturing human behavior, which has a bias towards immediate gratification. Quasi-Hyperbolic (QH) discounting is a simple alternative for modeling this bias. Unlike in traditional discounting, though, the optimal QH-policy, starting from some time $t_1,$ can be different to the one starting from $t_2.$ Hence, the future self of an agent, if it is naive or impatient, can deviate from the policy that is optimal at the start, leading to sub-optimal overall returns. To prevent this behavior, an alternative is to work with a policy anchored in a Markov Perfect Equilibrium (MPE). In this work, we propose the first model-free algorithm for finding an MPE. Using a two-timescale analysis, we show that, if our algorithm converges, then the limit must be an MPE. We also validate this claim numerically for the standard inventory system with stochastic demands. Our work significantly advances the practical application of reinforcement learning.
10.8DCApr 29
End-to-End and Phase-Level Performance Optimization for Hyperledger FabricPavan Sollu, Aniruddha Mukherjee, Divya Pulivarthi et al.
Hyperledger Fabric (HLF) is a modular, permissioned blockchain widely adopted in enterprise settings. Enhancing its throughput and latency remains challenging, as optimization decisions made in one phase of the transaction lifecycle can adversely affect other phases. In this work, we present a systematic, phase-level and end-to-end study of HLF optimizations along three fronts, combining production-grade testbed experiments with calibrated SimPy simulations. First, we introduce two novel optimization techniques that target commit-phase bottlenecks: block-level pipelining and strategic waiting. In pipelining, we overlap validation and private-data acquisition of successive blocks with state-consistency checks and ledger updates improving commit throughput by up to 1.9x. Strategic waiting coordinates commit progress by temporarily pausing fast leaders and boosting laggers to sustain endorsement parallelism, yielding up to a 1.2x higher throughput. Second, we conduct micro-benchmarking of three configuration levers: private-data dissemination, block-size selection, and endorsement peer selection. Our results reveal that: (i) Relaxed quorums for private-data dissemination significantly reduce latency in both endorsement and commit phases; (ii) Under light workloads, smaller blocks yield lower end-to-end latency, whereas, under heavy workloads, larger blocks are necessary to improve throughput and reduce latency; and (iii) Relaxed leader selection dramatically reduces dropped transactions and boosts endorsement throughput, with a modest increase in MVCC invalidations. Finally, we analyze the interplay among private-data dissemination, VSCC parallelization, and pipelined commits. Interestingly, the throughput gains over a serial commit path are maximized at a moderate level of parallelization. Together, our findings provide phase-aware and protocol-level refinements for optimizing HLF.
LGJun 8, 2025
Monotone and Conservative Policy Iteration Beyond the Tabular CaseS. R. Eshwar, Gugan Thoppe, Ananyabrata Barua et al.
We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI's evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI's guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI's guarantees often fail-leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI-style guarantees for \textit{arbitrary} function classes, RPI and CRPI provide a principled basis for next-generation RL.
LGSep 7, 2025
Teaching Precommitted Agents: Model-Free Policy Evaluation and Control in Quasi-Hyperbolic Discounted MDPsS. R. Eshwar
Time-inconsistent preferences, where agents favor smaller-sooner over larger-later rewards, are a key feature of human and animal decision-making. Quasi-Hyperbolic (QH) discounting provides a simple yet powerful model for this behavior, but its integration into the reinforcement learning (RL) framework has been limited. This paper addresses key theoretical and algorithmic gaps for precommitted agents with QH preferences. We make two primary contributions: (i) we formally characterize the structure of the optimal policy, proving for the first time that it reduces to a simple one-step non-stationary form; and (ii) we design the first practical, model-free algorithms for both policy evaluation and Q-learning in this setting, both with provable convergence guarantees. Our results provide foundational insights for incorporating QH preferences in RL.
SYJun 20, 2024
Online Learning of Weakly Coupled MDP Policies for Load Balancing and Auto ScalingS. R. Eshwar, Lucas Lopes Felipe, Alexandre Reiffers-Masson et al.
Load balancing and auto scaling are at the core of scalable, contemporary systems, addressing dynamic resource allocation and service rate adjustments in response to workload changes. This paper introduces a novel model and algorithms for tuning load balancers coupled with auto scalers, considering bursty traffic arriving at finite queues. We begin by presenting the problem as a weakly coupled Markov Decision Processes (MDP), solvable via a linear program (LP). However, as the number of control variables of such LP grows combinatorially, we introduce a more tractable relaxed LP formulation, and extend it to tackle the problem of online parameter learning and policy optimization using a two-timescale algorithm based on the LP Lagrangian.