P. R. Kumar

h-index25

38papers

376citations

Novelty54%

AI Score51

Ranked #41,347 of 201,326 authors (top 21%)#9,454 in LG (top 22%)

38 Papers

AINov 2, 2022

Energy System Digitization in the Era of AI: A Three-Layered Approach towards Carbon Neutrality

Le Xie, Tong Huang, Xiangtian Zheng et al.

The transition towards carbon-neutral electricity is one of the biggest game changers in addressing climate change since it addresses the dual challenges of removing carbon emissions from the two largest sectors of emitters: electricity and transportation. The transition to a carbon-neutral electric grid poses significant challenges to conventional paradigms of modern grid planning and operation. Much of the challenge arises from the scale of the decision making and the uncertainty associated with the energy supply and demand. Artificial Intelligence (AI) could potentially have a transformative impact on accelerating the speed and scale of carbon-neutral transition, as many decision making processes in the power grid can be cast as classic, though challenging, machine learning tasks. We point out that to amplify AI's impact on carbon-neutral transition of the electric energy systems, the AI algorithms originally developed for other applications should be tailored in three layers of technology, markets, and policy.

SYNov 29, 2016

Architecture and Algorithms for Privacy Preserving Thermal Inertial Load Management by A Load Serving Entity

Abhishek Halder, Xinbo Geng, P. R. Kumar et al.

Motivated by the growing importance of demand response in modern power system's operations, we propose an architecture and supporting algorithms for privacy preserving thermal inertial load management as a service provided by the load serving entity (LSE). We focus on an LSE managing a population of its customers' air conditioners, and propose a contractual model where the LSE guarantees quality of service to each customer in terms of keeping their indoor temperature trajectories within respective bands around the desired individual comfort temperatures. We show how the LSE can price the contracts differentiated by the flexibility embodied by the width of the specified bands. We address architectural questions of (i) how the LSE can strategize its energy procurement based on price and ambient temperature forecasts, (ii) how an LSE can close the real time control loop at the aggregate level while providing individual comfort guarantees to loads, without ever measuring the states of an air conditioner for privacy reasons. Control algorithms to enable our proposed architecture are given, and their efficacy is demonstrated on real data.

31.0LGMay 21Code

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

Romil V. Sonigra, P. R. Kumar

Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smoothness in their learned transition dynamics, leaving a useful inductive bias for transition dynamics learning unexploited. We propose GPLD, a gradient-penalized latent dynamics regularizer for DreamerV3 that applies a row-wise Jacobian penalty to the posterior latent distribution to encourage locally smooth transition learning. We show that this penalty can be interpreted as the continuous-latent analog of finite-difference smoothing of transition laws in discrete embedded-state MDPs, and estimate it efficiently using Hutchinson-style stochastic probes. Empirically, across DeepMind Control proprioceptive tasks, GPLD improves aggregate sample efficiency, with particularly strong gains on higher-complexity locomotion environments. On more challenging quadruped tasks, GPLD reaches high-return behavior earlier and exhibits more consistent late-stage learning over longer horizons. Explicit local smoothness regularization is a simple and effective way to improve latent world models for smooth continuous control environments. Code for GPLD is available at github.com/romils9/gpld-mbrl .

SYFeb 27, 2015

Optimal Energy-Efficient Regular Delivery of Packets in Cyber-Physical Systems

Xueying Guo, Rahul Singh, P. R. Kumar et al.

In cyber-physical systems such as in-vehicle wireless sensor networks, a large number of sensor nodes continually generate measurements that should be received by other nodes such as actuators in a regular fashion. Meanwhile, energy-efficiency is also important in wireless sensor networks. Motivated by these, we develop scheduling policies which are energy efficient and simultaneously maintain "regular" deliveries of packets. A tradeoff parameter is introduced to balance these two conflicting objectives. We employ a Markov Decision Process (MDP) model where the state of each client is the time-since-last-delivery of its packet, and reduce it into an equivalent finite-state MDP problem. Although this equivalent problem can be solved by standard dynamic programming techniques, it suffers from a high-computational complexity. Thus we further pose the problem as a restless multi-armed bandit problem and employ the low-complexity Whittle Index policy. It is shown that this problem is indexable and the Whittle indexes are derived. Also, we prove the Whittle Index policy is asymptotically optimal and validate its optimality via extensive simulations.

LGJul 17, 2023

Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation

Ruida Zhou, Tao Liu, Min Cheng et al.

We study robust reinforcement learning (RL) with the goal of determining a well-performing policy that is robust against model mismatch between the training simulator and the testing environment. Previous policy-based robust RL algorithms mainly focus on the tabular setting under uncertainty sets that facilitate robust policy evaluation, but are no longer tractable when the number of states scales up. To this end, we propose two novel uncertainty set formulations, one based on double sampling and the other on an integral probability metric. Both make large-scale robust RL tractable even when one only has access to a simulator. We propose a robust natural actor-critic (RNAC) approach that incorporates the new uncertainty sets and employs function approximation. We provide finite-time convergence guarantees for the proposed RNAC algorithm to the optimal robust policy within the function approximation error. Finally, we demonstrate the robust performance of the policy learned by our proposed RNAC approach in multiple MuJoCo environments and a real-world TurtleBot navigation task.

SYJun 27, 2016

Dynamic Watermarking: Active Defense of Networked Cyber-Physical Systems

Bharadwaj Satchidanandan, P. R. Kumar

The coming decades may see the large scale deployment of networked cyber-physical systems to address global needs in areas such as energy, water, healthcare, and transportation. However, as recent events have shown, such systems are vulnerable to cyber attacks. Being safety critical, their disruption or misbehavior can cause economic losses or injuries and loss of life. It is therefore important to secure such networked cyber-physical systems against attacks. In the absence of credible security guarantees, there will be resistance to the proliferation of cyber-physical systems, which are much needed to meet global needs in critical infrastructures and services. This paper addresses the problem of secure control of networked cyber-physical systems. This problem is different from the problem of securing the communication network, since cyber-physical systems at their very essence need sensors and actuators that interface with the physical plant, and malicious agents may tamper with sensors or actuators, as recent attacks have shown. We consider physical plants that are being controlled by multiple actuators and sensors communicating over a network, where some sensors could be "malicious," meaning that they may not report the measurements that they observe. We address a general technique by which the actuators can detect the actions of malicious sensors in the system, and disable closed-loop control based on their information. This technique, called "watermarking," employs the technique of actuators injecting private excitation into the system which will reveal malicious tampering with signals. We show how such an active defense can be used to secure networked systems of sensors and actuators.

NIJun 6, 2016

Throughput Optimal Decentralized Scheduling of Multi-Hop Networks with End-to-End Deadline Constraints: Unreliable Links

Rahul Singh, P. R. Kumar

We consider unreliable multi-hop networks serving multiple flows in which packets not delivered to their destination nodes by their deadlines are dropped. We address the design of policies for routing and scheduling packets that optimize any specified weighted average of the throughputs of the flows. We provide a new approach which directly yields an optimal distributed scheduling policy that attains any desired maximal timely-throughput vector under average-power constraints on the nodes. It pursues a novel intrinsically stochastic decomposition of the Lagrangian of the constrained network-wide MDP rather than of the fluid model. All decisions regarding a packet's transmission scheduling, transmit power level, and routing, are completely distributed, based solely on the age of the packet, not requiring any knowledge of network state or queue lengths at any of the nodes. Global coordination is achieved through a tractably computable "price" for transmission energy. This price is different from that used to derive the backpressure policy where price corresponds to queue lengths. A quantifiably near-optimal policy is provided if nodes have peak-power constraints.

LGJun 10, 2022

Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

Ruida Zhou, Tao Liu, Dileep Kalathil et al.

We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions, which are to be jointly optimized according to given criteria such as proportional fairness (smooth concave scalarization), hard constraints (constrained MDP), and max-min trade-off. We propose an Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework, which can systematically incorporate ideas from well-performing first-order methods into the design of policy optimization algorithms for multi-objective MDP problems. Theoretically, the designed algorithms based on the ARNPG framework achieve $\tilde{O}(1/T)$ global convergence with exact gradients. Empirically, the ARNPG-guided algorithms also demonstrate superior performance compared to some existing policy gradient-based approaches in both exact gradients and sample-based scenarios.

SYApr 21, 2018

An Online Detection Framework for Cyber Attacks on Automatic Generation Control

Tong Huang, Bharadwaj Satchidanandan, P. R. Kumar et al.

We propose an online framework to detect cyber attacks on Automatic Generation Control (AGC). A cyber at- tack detection algorithm is designed based on the approach of Dynamic Watermarking. The detection algorithm provides a theoretical guarantee of detection of cyber attacks launched by sophisticated attackers possessing extensive knowledge of the physical and statistical models of targeted power systems. The proposed framework is practically implementable, as it needs no hardware update on generation units. The efficacy of the proposed framework is validated in both four-area system and 140-bus system.

LGOct 15, 2023

Provably Fast Convergence of Independent Natural Policy Gradient for Markov Potential Games

Youbang Sun, Tao Liu, Ruida Zhou et al.

This work studies an independent natural policy gradient (NPG) algorithm for the multi-agent reinforcement learning problem in Markov potential games. It is shown that, under mild technical assumptions and the introduction of the \textit{suboptimality gap}, the independent NPG method with an oracle providing exact policy evaluation asymptotically reaches an $ε$-Nash Equilibrium (NE) within $\mathcal{O}(1/ε)$ iterations. This improves upon the previous best result of $\mathcal{O}(1/ε^2)$ iterations and is of the same order, $\mathcal{O}(1/ε)$, that is achievable for the single-agent case. Empirical results for a synthetic potential game and a congestion game are presented to verify the theoretical bounds.

SYMay 15, 2012

Bounded epsilon-Reach Set Computation of a Class of Deterministic and Transversal Linear Hybrid Automata

Kyoung-Dae Kim, Sayan Mitra, P. R. Kumar

We define a special class of hybrid automata, called Deterministic and Transversal Linear Hybrid Automata (DTLHA), whose continuous dynamics in each location are linear time-invariant (LTI) with a constant input, and for which every discrete transition up to a given bounded time is deterministic and, importantly, transversal. For such a DTLHA starting from an initial state, we show that it is possible to compute an approximation of the reach set of a DTLHA over a finite time interval that is arbitrarily close to the exact reach set, called a bounded epsilon-reach set, through sampling and polyhedral over-approximation of sampled states. We propose an algorithm and an attendant architecture for the overall bounded epsilon-reach set computation process.

LGFeb 10

Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

Akshay Mete, Shahid Aamir Sheikh, Tzu-Hsiang Lin et al.

Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.

LGJan 29, 2023

Bounded (O(1)) Regret Recommendation Learning via Synthetic Controls Oracle

Enoch Hyunwook Kang, P. R. Kumar

In online exploration systems where users with fixed preferences repeatedly arrive, it has recently been shown that O(1), i.e., bounded regret, can be achieved when the system is modeled as a linear contextual bandit. This result may be of interest for recommender systems, where the popularity of their items is often short-lived, as the exploration itself may be completed quickly before potential long-run non-stationarities come into play. However, in practice, exact knowledge of the linear model is difficult to justify. Furthermore, potential existence of unobservable covariates, uneven user arrival rates, interpretation of the necessary rank condition, and users opting out of private data tracking all need to be addressed for practical recommender system applications. In this work, we conduct a theoretical study to address all these issues while still achieving bounded regret. Aside from proof techniques, the key differentiating assumption we make here is the presence of effective Synthetic Control Methods (SCM), which are shown to be a practical relaxation of the exact linear model knowledge assumption. We verify our theoretical bounded regret result using a minimal simulation experiment.

LGOct 17, 2023

Value-Biased Maximum Likelihood Estimation for Model-based Reinforcement Learning in Discounted Linear MDPs

Yu-Heng Hung, Ping-Chun Hsieh, Akshay Mete et al.

We consider the infinite-horizon linear Markov Decision Processes (MDPs), where the transition probabilities of the dynamic model can be linearly parameterized with the help of a predefined low-dimensional feature mapping. While the existing regression-based approaches have been theoretically shown to achieve nearly-optimal regret, they are computationally rather inefficient due to the need for a large number of optimization runs in each time step, especially when the state and action spaces are large. To address this issue, we propose to solve linear MDPs through the lens of Value-Biased Maximum Likelihood Estimation (VBMLE), which is a classic model-based exploration principle in the adaptive control literature for resolving the well-known closed-loop identification problem of Maximum Likelihood Estimation. We formally show that (i) VBMLE enjoys $\widetilde{O}(d\sqrt{T})$ regret, where $T$ is the time horizon and $d$ is the dimension of the model parameter, and (ii) VBMLE is computationally more efficient as it only requires solving one optimization problem in each time step. In our regret analysis, we offer a generic convergence result of MLE in linear MDPs through a novel supermartingale construct and uncover an interesting connection between linear MDPs and online learning, which could be of independent interest. Finally, the simulation results show that VBMLE significantly outperforms the benchmark method in terms of both empirical regret and computation time.

LGMar 9, 2024

Provable Policy Gradient Methods for Average-Reward Markov Potential Games

Min Cheng, Ruida Zhou, P. R. Kumar et al.

We study Markov potential games under the infinite horizon average reward criterion. Most previous studies have been for discounted rewards. We prove that both algorithms based on independent policy gradient and independent natural policy gradient converge globally to a Nash equilibrium for the average reward criterion. To set the stage for gradient-based methods, we first establish that the average reward is a smooth function of policies and provide sensitivity bounds for the differential value functions, under certain conditions on ergodicity and the second largest eigenvalue of the underlying Markov decision process (MDP). We prove that three algorithms, policy gradient, proximal-Q, and natural policy gradient (NPG), converge to an $ε$-Nash equilibrium with time complexity $O(\frac{1}{ε^2})$, given a gradient/differential Q function oracle. When policy gradients have to be estimated, we propose an algorithm with $\tilde{O}(\frac{1}{\min_{s,a}π(a|s)δ})$ sample complexity to achieve $δ$ approximation error w.r.t~the $\ell_2$ norm. Equipped with the estimator, we derive the first sample complexity analysis for a policy gradient ascent algorithm, featuring a sample complexity of $\tilde{O}(1/ε^5)$. Simulation studies are presented.

LGMay 4, 2024

Linear Convergence of Independent Natural Policy Gradient in Games with Entropy Regularization

Youbang Sun, Tao Liu, P. R. Kumar et al.

This work focuses on the entropy-regularized independent natural policy gradient (NPG) algorithm in multi-agent reinforcement learning. In this work, agents are assumed to have access to an oracle with exact policy evaluation and seek to maximize their respective independent rewards. Each individual's reward is assumed to depend on the actions of all the agents in the multi-agent system, leading to a game between agents. We assume all agents make decisions under a policy with bounded rationality, which is enforced by the introduction of entropy regularization. In practice, a smaller regularization implies the agents are more rational and behave closer to Nash policies. On the other hand, agents with larger regularization acts more randomly, which ensures more exploration. We show that, under sufficient entropy regularization, the dynamics of this system converge at a linear rate to the quantal response equilibrium (QRE). Although regularization assumptions prevent the QRE from approximating a Nash equilibrium, our findings apply to a wide range of games, including cooperative, potential, and two-player matrix games. We also provide extensive empirical results on multiple games (including Markov games) as a verification of our theoretical analysis.

LGAug 21, 2025

Vector preference-based contextual bandits under distributional shifts

Apurv Shukla, P. R. Kumar

We consider contextual bandit learning under distribution shift when reward vectors are ordered according to a given preference cone. We propose an adaptive-discretization and optimistic elimination based policy that self-tunes to the underlying distribution shift. To measure the performance of this policy, we introduce the notion of preference-based regret which measures the performance of a policy in terms of distance between Pareto fronts. We study the performance of this policy by establishing upper bounds on its regret under various assumptions on the nature of distribution shift. Our regret bounds generalize known results for the existing case of no distribution shift and vectorial reward settings, and scale gracefully with problem parameters in presence of distribution shifts.

SYMay 26, 2023

Finite Time Regret Bounds for Minimum Variance Control of Autoregressive Systems with Exogenous Inputs

Rahul Singh, Akshay Mete, Avik Kar et al.

Minimum variance controllers have been employed in a wide-range of industrial applications. A key challenge experienced by many adaptive controllers is their poor empirical performance in the initial stages of learning. In this paper, we address the problem of initializing them so that they provide acceptable transients, and also provide an accompanying finite-time regret analysis, for adaptive minimum variance control of an auto-regressive system with exogenous inputs (ARX). Following [3], we consider a modified version of the Certainty Equivalence (CE) adaptive controller, which we call PIECE, that utilizes probing inputs for exploration. We show that it has a $C \log T$ bound on the regret after $T$ time-steps for bounded noise, and $C\log^2 T$ in the case of sub-Gaussian noise. The simulation results demonstrate the advantage of PIECE over the algorithm proposed in [3] as well as the standard Certainty Equivalence controller especially in the initial learning phase. To the best of our knowledge, this is the first work that provides finite-time regret bounds for an adaptive minimum variance controller.

OCJan 25, 2022

Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems

Akshay Mete, Rahul Singh, P. R. Kumar

We consider the problem of controlling an unknown stochastic linear system with quadratic costs - called the adaptive LQ control problem. We re-examine an approach called ''Reward Biased Maximum Likelihood Estimate'' (RBMLE) that was proposed more than forty years ago, and which predates the ''Upper Confidence Bound'' (UCB) method as well as the definition of ''regret'' for bandit problems. It simply added a term favoring parameters with larger rewards to the criterion for parameter estimation. We show how the RBMLE and UCB methods can be reconciled, and thereby propose an Augmented RBMLE-UCB algorithm that combines the penalty of the RBMLE method with the constraints of the UCB method, uniting the two approaches to optimism in the face of uncertainty. We establish that theoretically, this method retains $\Tilde{\mathcal{O}}(\sqrt{T})$ regret, the best-known so far. We further compare the empirical performance of the proposed Augmented RBMLE-UCB and the standard RBMLE (without the augmentation) with UCB, Thompson Sampling, Input Perturbation, Randomized Certainty Equivalence and StabL on many real-world examples including flight control of Boeing 747 and Unmanned Aerial Vehicle. We perform extensive simulation studies showing that the Augmented RBMLE consistently outperforms UCB, Thompson Sampling and StabL by a huge margin, while it is marginally better than Input Perturbation and moderately better than Randomized Certainty Equivalence.

LGOct 31, 2021

Policy Optimization for Constrained MDPs with Provable Fast Global Convergence

Tao Liu, Ruida Zhou, Dileep Kalathil et al.

We address the problem of finding the optimal policy of a constrained Markov decision process (CMDP) using a gradient descent-based algorithm. Previous results have shown that a primal-dual approach can achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate for both the optimality gap and the constraint violation. We propose a new algorithm called policy mirror descent-primal dual (PMD-PD) algorithm that can provably achieve a faster $\mathcal{O}(\log(T)/T)$ convergence rate for both the optimality gap and the constraint violation. For the primal (policy) update, the PMD-PD algorithm utilizes a modified value function and performs natural policy gradient steps, which is equivalent to a mirror descent step with appropriate regularization. For the dual update, the PMD-PD algorithm uses modified Lagrange multipliers to ensure a faster convergence rate. We also present two extensions of this approach to the settings with zero constraint violation and sample-based estimation. Experimental results demonstrate the faster convergence rate and the better performance of the PMD-PD algorithm compared with existing policy gradient-based algorithms.

LGSep 27, 2021

Learning from Few Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales

Tao Liu, P. R. Kumar, Ruida Zhou et al.

Motivated by the problem of learning with small sample sizes, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{maximum} similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have not been studied theoretically. We address this lacuna and show that positive definiteness indeed holds \textit{with high probability} for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes.

LGJun 4, 2021

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Tao Liu, Ruida Zhou, Dileep Kalathil et al.

We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.

LGNov 16, 2020

Reward Biased Maximum Likelihood Estimation for Reinforcement Learning

Akshay Mete, Rahul Singh, Xi Liu et al.

The Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed to overcome the central obstacle of what is variously called the fundamental "closed-identifiability problem" of adaptive control, the "dual control problem", or, contemporaneously, the "exploration vs. exploitation problem". It exploited the key observation that since the maximum likelihood parameter estimator can asymptotically identify the closed-transition probabilities under a certainty equivalent approach, the limiting parameter estimates must necessarily have an optimal reward that is less than the optimal reward attainable for the true but unknown system. Hence it proposed a counteracting reverse bias in favor of parameters with larger optimal rewards, providing a solution to the fundamental problem alluded to above. It thereby proposed an optimistic approach of favoring parameters with larger optimal rewards, now known as "optimism in the face of uncertainty". The RBMLE approach has been proved to be long-term average reward optimal in a variety of contexts. However, modern attention is focused on the much finer notion of "regret", or finite-time performance. Recent analysis of RBMLE for multi-armed stochastic bandits and linear contextual bandits has shown that it not only has state-of-the-art regret, but it also exhibits empirical performance comparable to or better than the best current contenders, and leads to strikingly simple index policies. Motivated by this, we examine the finite-time performance of RBMLE for reinforcement learning tasks that involve the general problem of optimal control of unknown Markov Decision Processes. We show that it has a regret of $\mathcal{O}( \log T)$ over a time horizon of $T$ steps, similar to state-of-the-art algorithms. Simulation studies show that RBMLE outperforms other algorithms such as UCRL2 and Thompson Sampling.

LGOct 8, 2020

Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits

Yu-Heng Hung, Ping-Chun Hsieh, Xi Liu et al.

Modifying the reward-biased maximum likelihood method originally proposed in the adaptive control literature, we propose novel learning algorithms to handle the explore-exploit trade-off in linear bandits problems as well as generalized linear bandits problems. We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with the state-of-the-art benchmark methods in extensive experiments. The new policies achieve this with low computation time per pull for linear bandits, and thereby resulting in both favorable regret as well as computational efficiency.

LGMar 21, 2020

Learning in Networked Control Systems

Rahul Singh, P. R. Kumar

We design adaptive controller (learning rule) for a networked control system (NCS) in which data packets containing control information are transmitted across a lossy wireless channel. We propose Upper Confidence Bounds for Networked Control Systems (UCB-NCS), a learning rule that maintains confidence intervals for the estimates of plant parameters $(A_{(\star)},B_{(\star)})$, and channel reliability $p_{(\star)}$, and utilizes the principle of optimism in the face of uncertainty while making control decisions. We provide non-asymptotic performance guarantees for UCB-NCS by analyzing its "regret", i.e., performance gap from the scenario when $(A_{(\star)},B_{(\star)},p_{(\star)})$ are known to the controller. We show that with a high probability the regret can be upper-bounded as $\tilde{O}\left(C\sqrt{T}\right)$\footnote{Here $\tilde{O}$ hides logarithmic factors.}, where $T$ is the operating time horizon of the system, and $C$ is a problem dependent constant.

LGJul 2, 2019

Exploration Through Reward Biasing: Reward-Biased Maximum Likelihood Estimation for Stochastic Multi-Armed Bandits

Xi Liu, Ping-Chun Hsieh, Anirban Bhattacharya et al.

Inspired by the Reward-Biased Maximum Likelihood Estimate method of adaptive control, we propose RBMLE -- a novel family of learning algorithms for stochastic multi-armed bandits (SMABs). For a broad range of SMABs including both the parametric Exponential Family as well as the non-parametric sub-Gaussian/Exponential family, we show that RBMLE yields an index policy. To choose the bias-growth rate $α(t)$ in RBMLE, we reveal the nontrivial interplay between $α(t)$ and the regret bound that generally applies in both the Exponential Family as well as the sub-Gaussian/Exponential family bandits. To quantify the finite-time performance, we prove that RBMLE attains order-optimality by adaptively estimating the unknown constants in the expression of $α(t)$ for Gaussian and sub-Gaussian bandits. Extensive experiments demonstrate that the proposed RBMLE achieves empirical regret performance competitive with the state-of-the-art methods, while being more computationally efficient and scalable in comparison to the best-performing ones among them.

LGOct 29, 2018

Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging

Ping-Chun Hsieh, Xi Liu, Anirban Bhattacharya et al.

Sequential decision making for lifetime maximization is a critical problem in many real-world applications, such as medical treatment and portfolio selection. In these applications, a "reneging" phenomenon, where participants may disengage from future interactions after observing an unsatisfiable outcome, is rather prevalent. To address the above issue, this paper proposes a model of heteroscedastic linear bandits with reneging, which allows each participant to have a distinct "satisfaction level," with any interaction outcome falling short of that level resulting in that participant reneging. Moreover, it allows the variance of the outcome to be context-dependent. Based on this model, we develop a UCB-type policy, namely HR-UCB, and prove that it achieves $\mathcal{O}\big(\sqrt{{T}(\log({T}))^{3}}\big)$ regret. Finally, we validate the performance of HR-UCB via simulations.

SYOct 14, 2018

Safe Intersection Management for Mixed Transportation Systems with Human-Driven and Autonomous Vehicles

Xi Liu, Ping-Chun Hsieh, P. R. Kumar

Most recent studies on establishing intersection safety focus on the situation where all vehicles are fully autonomous. However, currently most vehicles are human-driven and so we will need to transition through regimes featuring a varying proportion of human-driven vehicles ranging from 100\% to 0\% before realizing such a fully autonomous future -- if ever. We will, therefore, need to address the safety of hybrid systems featuring an arbitrary mixture of human-driven and autonomous vehicles. In fact, recent incidents involving autonomous vehicles have already highlighted the need to study the safety of autonomous vehicles co-existing with human-driven vehicles. Motivated by this we address the design of provably safe intersection management for mixed traffic consisting of a mix of human-driven vehicles (HVs) as well as autonomous vehicles (AVs). To analyze such mixed traffic, we model HVs as nearsighted and with relatively loose constraints, permitting worst-case behavior while AVs are considered as capable of following much tighter constraints. HVs are allowed freedom to change their speed at any time while AVs are only allowed to change their speed beginning of a time slot through a Model Predictive Controller (MPC). AVs are assumed to possess a shorter response time and stronger braking capability than HVs in collision avoidance. Moreover, AVs obtain the permissions of passing through the intersection through vehicle-to-infrastructure (V2I) communication, while HVs achieve the same objective by following traffic lights. Taking the above differences into consideration, we propose a provably safe intersection management for mixed traffic comprised of an MPC-based protocol for AVs along with a coordination protocol for traffic lights...

SYOct 13, 2018

Towards Provably Safe Mixed Transportation Systems with Human-driven and Automated Vehicles

Xi Liu, Ke Ma, P. R. Kumar

Currently, we are in an environment where the fraction of automated vehicles is negligibly small. We anticipate that this fraction will increase in coming decades before if ever, we have a fully automated transportation system. Motivated by this we address the problem of provable safety of mixed traffic consisting of both intelligent vehicles (IVs) as well as human-driven vehicles (HVs). An important issue that arises is that such mixed systems may well have lesser throughput than all human traffic systems if the automated vehicles are expected to remain provably safe with respect to human traffic. This necessitates the consideration of strategies such as platooning of automated vehicles in order to increase the throughput. In this paper, we address the design of provably safe systems consisting of a mix of automated and human-driven vehicles including the use of platooning by automated vehicles. We design motion planing policies and coordination rules for participants in this novel mixed system. HVs are considered as nearsighted and modeled with relatively loose constraints, while IVs are considered as capable of following much tighter constraints. HVs are expected to follow reasonable and simple rules. IVs are designed to move under a model predictive control (MPC) based motion plans and coordination protocols. Our contribution of this paper is in showing how to integrate these two types of models safely into a mixed system. System safety is proved in single lane scenarios, as well as in multi-lane situations allowing lane changes.

ROJan 30, 2018

On the Use of the Observability Gramian for Partially Observed Robotic Path Planning Problems

Mohammadhussein Rafieisakhaei, Suman Chakravorty, P. R. Kumar

Optimizing measures of the observability Gramian as a surrogate for the estimation performance may provide irrelevant or misleading trajectories for planning under observation uncertainty.

NISep 6, 2017

Throughput Optimal Decentralized Scheduling of Multi-Hop Networks with End-to-End Deadline Constraints: II Wireless Networks with Interference

Rahul Singh, P. R. Kumar, Eytan Modiano

Consider a multihop wireless network serving multiple flows in which wireless link interference constraints are described by a link interference graph. For such a network, we design routing-scheduling policies that maximize the end-to-end timely throughput of the network. Timely throughput of a flow $f$ is defined as the average rate at which packets of flow $f$ reach their destination node $d_f$ within their deadline. Our policy has several surprising characteristics. Firstly, we show that the optimal routing-scheduling decision for an individual packet that is present at a wireless node $i\in V$ is solely a function of its location, and "age". Thus, a wireless node $i$ does not require the knowledge of the "global" network state in order to maximize the timely throughput. We notice that in comparison, under the backpressure routing policy, a node $i$ requires only the knowledge of its neighbours queue lengths in order to guarantee maximal stability, and hence is decentralized. The key difference arises due to the fact that in our set-up the packets loose their utility once their "age" has crossed their deadline, thus making the task of optimizing timely throughput much more challenging than that of ensuring network stability. Of course, due to this key difference, the decision process involved in maximizing the timely throughput is also much more complex than that involved in ensuring network-wide queue stabilization. In view of this, our results are somewhat surprising.

ROMay 26, 2017

Near-Optimal Belief Space Planning via T-LQG

Mohammadhussein Rafieisakhaei, Suman Chakravorty, P. R. Kumar

We consider the problem of planning under observation and motion uncertainty for nonlinear robotics systems. Determining the optimal solution to this problem, generally formulated as a Partially Observed Markov Decision Process (POMDP), is computationally intractable. We propose a Trajectory-optimized Linear Quadratic Gaussian (T-LQG) approach that leads to quantifiably near-optimal solutions for the POMDP problem. We provide a novel "separation principle" for the design of an optimal nominal open-loop trajectory followed by an optimal feedback control law, which provides a near-optimal feedback control policy for belief space planning problems involving a polynomial order of calculations of minimum order.

ROMay 24, 2017

A Near-Optimal Separation Principle for Nonlinear Stochastic Systems Arising in Robotic Path Planning and Control

Mohammadhussein Rafieisakhaei, Suman Chakravorty, P. R. Kumar

We consider nonlinear stochastic systems that arise in path planning and control of mobile robots. As is typical of almost all nonlinear stochastic systems, the optimally solving problem is intractable. We provide a design approach which yields a tractable design that is quantifiably near-optimal. We exhibit a "separation" principle under a small noise assumption consisting of the optimal open-loop design of nominal trajectory followed by an optimal feedback law to track this trajectory, which is different from the usual effort of separating estimation from control. As a corollary, we obtain a trajectory-optimized linear quadratic regulator design for stochastic nonlinear systems with Gaussian noise.

ROAug 10, 2016

Belief Space Planning Simplified: Trajectory-Optimized LQG (T-LQG) (Extended Report)

Mohammadhussein Rafieisakhaei, Suman Chakravorty, P. R. Kumar

Planning under motion and observation uncertainties requires solution of a stochastic control problem in the space of feedback policies. In this paper, we reduce the general (n^2+n)-dimensional belief space planning problem to an (n)-dimensional problem by obtaining a Linear Quadratic Gaussian (LQG) design with the best nominal performance. Then, by taking the underlying trajectory of the LQG controller as the decision variable, we pose a coupled design of trajectory, estimator, and controller design through a Non-Linear Program (NLP) that can be solved by a general NLP solver. We prove that under a first-order approximation and a careful usage of the separation principle, our approximations are valid. We give an analysis on the existing major belief space planning methods and show that our algorithm has the lowest computational burden. Finally, we extend our solution to contain general state and control constraints. Our simulation results support our design.

ROMay 5, 2016

Non-Gaussian SLAP: Simultaneous Localization and Planning Under Non-Gaussian Uncertainty in Static and Dynamic Environments

Mohammadhussein Rafieisakhaei, Suman Chakravorty, P. R. Kumar

Simultaneous Localization and Planning (SLAP) under process and measurement uncertainties is a challenge. It involves solving a stochastic control problem modeled as a Partially Observed Markov Decision Process (POMDP) in a general framework. For a convex environment, we propose an optimization-based open-loop optimal control problem coupled with receding horizon control strategy to plan for high quality trajectories along which the uncertainty of the state localization is reduced while the system reaches to a goal state with minimum control effort. In a static environment with non-convex state constraints, the optimization is modified by defining barrier functions to obtain collision-free paths while maintaining the previous goals. By initializing the optimization with trajectories in different homotopy classes and comparing the resultant costs, we improve the quality of the solution in the presence of action and measurement uncertainties. In dynamic environments with time-varying constraints such as moving obstacles or banned areas, the approach is extended to find collision-free trajectories. In this paper, the underlying spaces are continuous, and beliefs are non-Gaussian. Without obstacles, the optimization is a globally convex problem, while in the presence of obstacles it becomes locally convex. We demonstrate the performance of the method on different scenarios.

RONov 16, 2015

Feedback Motion Planning Under Non-Gaussian Uncertainty and Non-Convex State Constraints

Mohammadhussein Rafieisakhaei, Amirhossein Tamjidi, Suman Chakravorty et al.

Planning under process and measurement uncertainties is a challenging problem. In its most general form it can be modeled as a Partially Observed Markov Decision Process (POMDP) problem. However POMDPs are generally difficult to solve when the underlying spaces are continuous, particularly when beliefs are non-Gaussian, and the difficulty is further exacerbated when there are also non-convex constraints on states. Existing algorithms to address such challenging POMDPs are expensive in terms of computation and memory. In this paper, we provide a feedback policy in non-Gaussian belief space via solving a convex program for common non-linear observation models. The solution involves a Receding Horizon Control strategy using particle filters for the non-Gaussian belief representation. We develop a way of capturing non-convex constraints in the state space and adapt the optimization to incorporate such constraints, as well. A key advantage of this method is that it does not introduce additional variables in the optimization problem and is therefore more scalable than existing constrained problems in belief space. We demonstrate the performance of the method on different scenarios.

SYOct 4, 2015

The ISO Problem: Decentralized Stochastic Control via Bidding Schemes

Rahul Singh, P. R. Kumar, Le Xie

We consider a smart-grid connecting several agents, modeled as stochastic dynamical systems, who may be electricity consumers/producers. At each discrete time instant, which may represent a 15 minute interval, each agent may consume/generate some quantity of electrical energy. The Independent System Operator (ISO) is given the task of assigning consumptions/generations to the agents so as to maximize the sum of the utilities accrued to the agents, subject to the constraint that energy generation equals consumption at each time. This task of coordinating generation and demand has to be accomplished by the ISO without the agents revealing their system states, dynamics, or utility/cost functions. We show how and when a simple iterative procedure converges to the optimal solution. The ISO iteratively obtains electricity bids by the agents, and declares the tentative market clearing prices. In response to these prices, the agents submit new bids. On the demand side, the solution yields an optimal demand response for dynamic and stochastic loads. On the generation side, it provides the optimal utilization of stochastically varying renewables such as solar/wind, and generation with fossil fuel based generation with dynamic constraints such as ramping rates. Thereby we solve a decentralized stochastic control problem, without agents sharing any information about their system models, states or utility functions.

SYJul 16, 2015

A Theory for the Operation of the Independent System Operator in a Smart Grid with Stochastic Renewables, Demand Response and Storage

Rahul Singh, Ke Ma, Anupam Thatte et al.

In this paper, we address a key issue of designing architectures and algorithms which generate optimal demand response in a decentralized manner for a smart-grid consisting of several stochastic renewables and dynamic loads. By optimal demand response, we refer to the demand response which maximizes the utility of the agents connected to the smart-grid. By decentralized we refer to the desirable case where neither the independent system operator (ISO) needs to know the dynamics/utilities of the agents, nor do the agents need to have a knowledge of the dynamics/utilities of other agents connected to the grid. The communication between the ISO and agents is restricted to the ISO announcing a pricing policy and the agents responding with their energy generation/consumption bids in response to the pricing policy. We provide a complete solution for both the deterministic and stochastic cases. It features a price iteration scheme that results in optimality of social welfare. We also provide an optimal solution for the case where there is a common randomness affecting and observed by all agents. This solution can be computationally complex, and we pose approximations. For the more general partially observed randomness case, we exhibit a relaxation that significantly reduces complexity. We also provide an approximation strategy that leads to a model predictive control (MPC) approach. Simulation results comparing the resulting optimal demand response with the existing architectures employed by the ISO illustrate the benefit in social welfare utility realized by our scheme. To the best of the authors' knowledge, this is the first work of its kind to explicitly mark out the optimal response of dynamic demand.