LGFeb 7, 2023
Leveraging Demonstrations to Improve Online Learning: Quality MattersBotao Hao, Rahul Jain, Tor Lattimore et al. · stanford
We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning algorithm and model. The demonstration data is generated by an expert with a given competence level, a notion we introduce. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes' rule and derive a prior-dependent Bayesian regret bound. This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert's competence level. We also develop a practical, approximate informed TS algorithm through Bayesian bootstrapping and show substantial empirical regret reduction through experiments.
SYAug 28, 2012
Dynamic Pricing of Power in Smart-Grid NetworksQingsi Wang, Mingyan Liu, Rahul Jain
In this paper we introduce the problem of dynamic pricing of power for smart-grid networks. This is studied within a network utility maximization (NUM) framework in a deterministic setting with a single provider, multiple users and a finite horizon. The provider produces power or buys power in a (deterministic) spot market, and determines a dynamic price to charge the users. The users then adjust their demand in response to the time-varying prices. This is typically categorized as the demand response problem, and we study a progression of related models by focusing on two aspects: 1) the characterization of the structure of the optimal dynamic prices in the Smart Grid and the optimal demand and supply under various interaction with a spot market; 2) a greedy approach to facilitate the solution process of the aggregate NUM problem and the optimality gap between the greedy solution and the optimal one.
PRDec 7, 2014
A queueing model with independent arrivals, and its fluid and diffusion limitsHarsha Honnappa, Rahul Jain, Amy R. Ward
We introduce the Δ(i)/GI/1 queue, a new queueing model. In this model, customers from a given population independently sample a time to arrive from some given distribution F. Thus, the arrival times are an ordered statistics, and the inter-arrival times are differences of consecutive ordered statistics. They are served by a single server which provides service according to a general distribution G, with independent service times. The exact model is analytically intractable. Thus, we develop fluid and diffusion limits for the various stochastic processes, and performance metrics. The fluid limit of the queue length is observed to be a reflected process, while the diffusion limit is observed to be a function of a Brownian motion and a Brownian bridge process, and is given by a 'netput' process and a directional derivative of the Skorokhod reflected fluid netput in the direction of a diffusion refinement of the netput process. We also observe what may be interpreted as a transient Little's law. Sample path analysis reveals various operating regimes where the diffusion limit switches between a free diffusion, a reflected diffusion process and the zero process, with possible discontinuities during regime switches. The weak convergence is established in the M1 topology, and it is also shown that this is not possible in the J1 topology.
GTDec 13, 2011
Strategic Arrivals into Queueing Networks: The Network Concert Queueing GameHarsha Honnappa, Rahul Jain
Queueing networks are typically modelled assuming that the arrival process is exogenous, and unaffected by admission control, scheduling policies, etc. In many situations, however, users choose the time of their arrival strategically, taking delay and other metrics into account. In this paper, we develop a framework to study such strategic arrivals into queueing networks. We start by deriving a functional strong law of large numbers (FSLLN) approximation to the queueing network. In the fluid limit derived, we then study the population game wherein users strategically choose when to arrive, and upon arrival which of the K queues to join. The queues start service at given times, which can potentially be different. We characterize the (strategic) arrival process at each of the queues, and the price of anarchy of the ensuing strategic arrival game. We then extend the analysis to multiple populations of users, each with a different cost metric. The equilibrium arrival profile and price of anarchy are derived. Finally, we present the methodology for exact equilibrium analysis. This, however, is tractable for only some simple cases such as two users arriving at a two node queueing network, which we then present.
RONov 12, 2022
Learning Neuro-symbolic Programs for Language Guided Robot ManipulationNamasivayam Kalithasan, Himanshu Singh, Vishal Bindal et al.
Given a natural language instruction and an input scene, our goal is to train a model to output a manipulation program that can be executed by the robot. Prior approaches for this task possess one of the following limitations: (i) rely on hand-coded symbols for concepts limiting generalization beyond those seen during training [1] (ii) infer action sequences from instructions but require dense sub-goal supervision [2] or (iii) lack semantics required for deeper object-centric reasoning inherent in interpreting complex instructions [3]. In contrast, our approach can handle linguistic as well as perceptual variations, end-to-end trainable and requires no intermediate supervision. The proposed model uses symbolic reasoning constructs that operate on a latent neural object-centric representation, allowing for deeper reasoning over the input scene. Central to our approach is a modular structure consisting of a hierarchical instruction parser and an action simulator to learn disentangled action representations. Our experiments on a simulated environment with a 7-DOF manipulator, consisting of instructions with varying number of steps and scenes with different number of objects, demonstrate that our model is robust to such variations and significantly outperforms baselines, particularly in the generalization settings. The code, dataset and experiment videos are available at https://nsrmp.github.io
LGJan 27, 2023
Safe Posterior Sampling for Constrained MDPs with Bounded Constraint ViolationKrishna C Kalagarla, Rahul Jain, Pierluigi Nuzzo
Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications. However, the model is often unknown and must be learned online while still ensuring the constraint is met, or at least the violation is bounded with time. Some recent papers have made progress on this very challenging problem but either need unsatisfactory assumptions such as knowledge of a safe policy, or have high cumulative regret. We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm achieves an efficient tradeoff between exploration and exploitation by use of the posterior sampling principle, and provably suffers only bounded constraint violation by leveraging the idea of pessimism. Our approach is based on a primal-dual approach. We establish a sub-linear $\tilde{\mathcal{ O}}\left(H^{2.5} \sqrt{|\mathcal{S}|^2 |\mathcal{A}| K} \right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.e., $\tilde{\mathcal{O}}\left(1\right)$ constraint violation regret over $K$ episodes for an $|\mathcal{S}|$-state, $|\mathcal{A}|$-action and horizon $H$ CMDP.
CVMay 14Code
MechVerse: Evaluating Physical Motion Consistency in Video Generation ModelsRahul Jain, Mayank Patel, Asim Unmesh et al.
Text- and image-conditioned video generation models have achieved strong visual fidelity and temporal coherence, but they often fail to generate motion governed by kinematic and geometric constraints. In these settings, object parts must remain rigid, maintain contact or coupling with neighboring components, and transfer motion consistently across connected parts. These requirements are especially explicit in articulated mechanical assemblies, where motion is constrained by rigid-link geometry, contact/coupling relations, and transmission through kinematic chains. A generated video may therefore appear plausible while violating the intended mechanism, such as rotating a part that should translate, deforming a rigid component, breaking coupling between parts, or failing to move downstream components. To evaluate this gap, We introduce MechVerse, a benchmark for mechanically consistent image-to-video generation. MechVerse contains 21,156 synthetic clips from 1,357 mechanical assemblies across 141 categories, organized into three tiers of increasing kinematic complexity: independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is paired with a structured prompt describing part identities, stationary supports, moving components, motion primitives, direction, speed/extent, and inter-part dependencies. We evaluate proprietary, open-source, and fine-tuned image-to-video models using standard video metrics, instruction-following scores, and human judgments of motion correctness and kinematic coupling. Results show that current models can preserve appearance and smoothness while failing to generate mechanically admissible motion, with errors increasing as coupling complexity grows. MechVerse provides a benchmark for measuring and improving mechanism-aware video generation from image and language inputs.
LGFeb 2, 2023
ACPO: A Policy Optimization Algorithm for Average MDPs with ConstraintsAkhil Agnihotri, Rahul Jain, Haipeng Luo
Reinforcement Learning (RL) for constrained MDPs (CMDPs) is an increasingly important problem for various applications. Often, the average criterion is more suitable than the discounted criterion. Yet, RL for average-CMDPs (ACMDPs) remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by trust region-based policy optimization algorithms. We develop basic sensitivity theory for average CMDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging OpenAI Gym environments, show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs.
LGMar 20, 2023
Bridging Imitation and Online Reinforcement Learning: An Optimistic TaleBotao Hao, Rahul Jain, Dengwang Tang et al.
In this paper, we address the following problem: Given an offline demonstration dataset from an imperfect expert, what is the best way to leverage it to bootstrap online learning performance in MDPs. We first propose an Informed Posterior Sampling-based RL (iPSRL) algorithm that uses the offline dataset, and information about the expert's behavioral policy used to generate the offline dataset. Its cumulative Bayesian regret goes down to zero exponentially fast in N, the offline dataset size if the expert is competent enough. Since this algorithm is computationally impractical, we then propose the iRLSVI algorithm that can be seen as a combination of the RLSVI algorithm for online RL, and imitation learning. Our empirical results show that the proposed iRLSVI algorithm is able to achieve significant reduction in regret as compared to two baselines: no offline data, and offline dataset but used without information about the generative policy. Our algorithm bridges online RL and imitation learning for the first time.
LGAug 24, 2023
Conditional Kernel Imitation Learning for Continuous State EnvironmentsRishabh Agrawal, Nathan Dahlin, Rahul Jain et al.
Imitation Learning (IL) is an important paradigm within the broader reinforcement learning (RL) methodology. Unlike most of RL, it does not assume availability of reward-feedback. Reward inference and shaping are known to be difficult and error-prone methods particularly when the demonstration data comes from human experts. Classical methods such as behavioral cloning and inverse reinforcement learning are highly sensitive to estimation errors, a problem that is particularly acute in continuous state space problems. Meanwhile, state-of-the-art IL algorithms convert behavioral policy learning problems into distribution-matching problems which often require additional online interaction data to be effective. In this paper, we consider the problem of imitation learning in continuous state space environments based solely on observed behavior, without access to transition dynamics information, reward structure, or, most importantly, any additional interactions with the environment. Our approach is based on the Markov balance equation and introduces a novel conditional kernel density estimation-based imitation learning framework. It involves estimating the environment's transition dynamics using conditional kernel density estimators and seeks to satisfy the probabilistic balance equations for the environment. We establish that our estimators satisfy basic asymptotic consistency requirements. Through a series of numerical experiments on continuous state benchmark environments, we show consistently superior empirical performance over many state-of-the-art IL algorithms.
LGOct 16, 2023
Posterior Sampling-based Online Learning for Episodic POMDPsDengwang Tang, Dongze Ye, Rahul Jain et al.
Learning in POMDPs is known to be significantly harder than in MDPs. In this paper, we consider the online learning problem for episodic POMDPs with unknown transition and observation models. We propose a Posterior Sampling-based reinforcement learning algorithm for POMDPs (PS4POMDPs), which is much simpler and more implementable compared to state-of-the-art optimism-based online learning algorithms for POMDPs. We show that the Bayesian regret of the proposed algorithm scales as the square root of the number of episodes and is polynomial in the other parameters. In a general setting, the regret scales exponentially in the horizon length $H$, and we show that this is inevitable by providing a lower bound. However, when the POMDP is undercomplete and weakly revealing (a common assumption in the recent literature), we establish a polynomial Bayesian regret bound. We finally propose a posterior sampling algorithm for multi-agent POMDPs, and show it too has sublinear regret.
LGApr 11, 2023
Equivalent and Compact Representations of Neural Network Controllers With Decision TreesKevin Chang, Nathan Dahlin, Rahul Jain et al.
Over the past decade, neural network (NN)-based controllers have demonstrated remarkable efficacy in a variety of decision-making tasks. However, their black-box nature and the risk of unexpected behaviors pose a challenge to their deployment in real-world systems requiring strong guarantees of correctness and safety. We address these limitations by investigating the transformation of NN-based controllers into equivalent soft decision tree (SDT)-based controllers and its impact on verifiability. In contrast to existing work, we focus on discrete-output NN controllers including rectified linear unit (ReLU) activation functions as well as argmax operations. We then devise an exact yet efficient transformation algorithm which automatically prunes redundant branches. We first demonstrate the practical efficacy of the transformation algorithm applied to an autonomous driving NN controller within OpenAI Gym's CarRacing environment. Subsequently, we evaluate our approach using two benchmarks from the OpenAI Gym environment. Our results indicate that the SDT transformation can benefit formal verification, showing runtime improvements of up to $21 \times$ and $2 \times$ for MountainCar-v0 and CartPole-v1, respectively.
LGOct 17, 2023
Efficient Online Learning with Offline Datasets for Infinite Horizon MDPs: A Bayesian ApproachDengwang Tang, Rahul Jain, Botao Hao et al.
In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (parameterized by a competence parameter) used by the expert, it can do substantially better in terms of minimizing cumulative regret, than if it doesn't do that. We establish an upper bound on regret of the exact informed PSRL algorithm that scales as $\tilde{O}(\sqrt{T})$. This requires a novel prior-dependent regret analysis of Bayesian online learning algorithms for the infinite horizon setting. We then propose the Informed RLSVI algorithm to efficiently approximate the iPSRL algorithm.
AIApr 10, 2023
A Novel Point-based Algorithm for Multi-agent Control Using the Common Information ApproachDengwang Tang, Ashutosh Nayyar, Rahul Jain
The Common Information (CI) approach provides a systematic way to transform a multi-agent stochastic control problem to a single-agent partially observed Markov decision problem (POMDP) called the coordinator's POMDP. However, such a POMDP can be hard to solve due to its extraordinarily large action space. We propose a new algorithm for multi-agent stochastic control problems, called coordinator's heuristic search value iteration (CHSVI), that combines the CI approach and point-based POMDP algorithms for large action spaces. We demonstrate the algorithm through optimally solving several benchmark problems.
CVMar 19
Narrative Aligned Long Form Video Question AnsweringRahul Jain, Keval Doshi, Burak Uzkent et al.
Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.
LGMay 16
When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models RevisitedRishabh Agrawal, Rahul Jain, Ashutosh Nayyar
Behavior Foundation Models (BFMs) enable scalable imitation learning (IL) by pretraining task-agnostic representations that can be rapidly adapted to new tasks. However, existing BFMs assume fixed environment dynamics, limiting their robustness under real-world shifts such as changes in friction, actuation, or sensor noise. We address this by formulating BFM task-inference as a robust minimax optimization problem, enabling adaptation to worst-case dynamics perturbations without modifying pretraining. To the best of our knowledge, this is the first BFM-based framework that achieves robustness to dynamics shifts while relying solely on offline data from a single nominal environment. Our approach significantly outperforms standard BFM and robust offline IL baselines under dynamics shifts. These results demonstrate that robust policy can be achieved entirely at task-inference time, improving the practicality of BFMs in dynamic settings.
LGAug 17, 2024
Markov Balance Satisfaction Improves Performance in Strictly Batch Offline Imitation LearningRishabh Agrawal, Nathan Dahlin, Rahul Jain et al.
Imitation learning (IL) is notably effective for robotic tasks where directly programming behaviors or defining optimal control costs is challenging. In this work, we address a scenario where the imitator relies solely on observed behavior and cannot make environmental interactions during learning. It does not have additional supplementary datasets beyond the expert's dataset nor any information about the transition dynamics. Unlike state-of-the-art (SOTA) IL methods, this approach tackles the limitations of conventional IL by operating in a more constrained and realistic setting. Our method uses the Markov balance equation and introduces a novel conditional density estimation-based imitation learning framework. It employs conditional normalizing flows for transition dynamics estimation and aims at satisfying a balance equation for the environment. Through a series of numerical experiments on Classic Control and MuJoCo environments, we demonstrate consistently superior empirical performance compared to many SOTA IL algorithms.
LGOct 15, 2023
Gender-Based Comparative Study of Type 2 Diabetes Risk Factors in Kolkata, India: A Machine Learning ApproachRahul Jain, Anoushka Saha, Gourav Daga et al.
Type 2 diabetes mellitus represents a prevalent and widespread global health concern, necessitating a comprehensive assessment of its risk factors. This study aimed towards learning whether there is any differential impact of age, Lifestyle, BMI and Waist to height ratio on the risk of Type 2 diabetes mellitus in males and females in Kolkata, West Bengal, India based on a sample observed from the out-patient consultation department of Belle Vue Clinic in Kolkata. Various machine learning models like Logistic Regression, Random Forest, and Support Vector Classifier, were used to predict the risk of diabetes, and performance was compared based on different predictors. Our findings indicate a significant age-related increase in risk of diabetes for both males and females. Although exercising and BMI was found to have significant impact on the risk of Type 2 diabetes in males, in females both turned out to be statistically insignificant. For both males and females, predictive models based on WhtR demonstrated superior performance in risk assessment compared to those based on BMI. This study sheds light on the gender-specific differences in the risk factors for Type 2 diabetes, offering valuable insights that can be used towards more targeted healthcare interventions and public health strategies.
CVFeb 24
Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action SegmentationAsim Unmesh, Kaki Ramesh, Mayank Patel et al.
Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.
LGMar 21
Bayesian Learning in Episodic Zero-Sum GamesChang-Wei Yueh, Andy Zhao, Ashutosh Nayyar et al.
We study Bayesian learning in episodic, finite-horizon zero-sum Markov games with unknown transition and reward models. We investigate a posterior algorithm in which each player maintains a Bayesian posterior over the game model, independently samples a game model at the beginning of each episode, and computes an equilibrium policy for the sampled model. We analyze two settings: (i) Both players use the posterior sampling algorithm, and (ii) Only one player uses posterior sampling while the opponent follows an arbitrary learning algorithm. In each setting, we provide guarantees on the expected regret of the posterior sampling agent. Our notion of regret compares the expected total reward of the learning agent against the expected total reward under equilibrium policies of the true game. Our main theoretical result is an expected regret bound for the posterior sampling agent of order $O(HS\sqrt{ABHK\log(SABHK)})$ where $K$ is the number of episodes, $H$ is the episode length, $S$ is the number of states, and $A,B$ are the action space sizes of the two players. Experiments in a grid-world predator--prey domain illustrate the sublinear regret scaling and show that posterior sampling competes favorably with a fictitious-play baseline.
LGNov 11, 2025
Balance Equation-based Distributionally Robust Offline Imitation LearningRishabh Agrawal, Yusuf Alvi, Rahul Jain et al.
Imitation Learning (IL) has proven highly effective for robotic and control tasks where manually designing reward functions or explicit controllers is infeasible. However, standard IL methods implicitly assume that the environment dynamics remain fixed between training and deployment. In practice, this assumption rarely holds where modeling inaccuracies, real-world parameter variations, and adversarial perturbations can all induce shifts in transition dynamics, leading to severe performance degradation. We address this challenge through Balance Equation-based Distributionally Robust Offline Imitation Learning, a framework that learns robust policies solely from expert demonstrations collected under nominal dynamics, without requiring further environment interaction. We formulate the problem as a distributionally robust optimization over an uncertainty set of transition models, seeking a policy that minimizes the imitation loss under the worst-case transition distribution. Importantly, we show that this robust objective can be reformulated entirely in terms of the nominal data distribution, enabling tractable offline learning. Empirical evaluations on continuous-control benchmarks demonstrate that our approach achieves superior robustness and generalization compared to state-of-the-art offline IL baselines, particularly under perturbed or shifted environments.
LGFeb 4, 2025
Robust LLM Alignment via Distributionally Robust Direct Preference OptimizationZaiyan Xu, Sushil Vemuri, Kishan Panaganti et al.
A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift. LLM alignment algorithms rely on static preference datasets, assuming that they accurately represent real-world user preferences. However, user preferences vary significantly across geographical regions, demographics, linguistic patterns, and evolving cultural trends. This preference distribution shift leads to catastrophic alignment failures in many real-world applications. We address this problem using the principled framework of distributionally robust optimization, and develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO). We characterize the sample complexity of learning the optimal policy parameters for WDPO and KLDPO. Moreover, we propose scalable gradient descent-style learning algorithms by developing suitable approximations for the challenging minimax loss functions of WDPO and KLDPO. Our empirical experiments using benchmark data sets and LLMs demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.
SPApr 14
Applied AI-Enhanced RF Interference RejectionRahul Jain, Pierre Trepagnier, Rick Gentile et al.
AI-enhanced interference rejection in radio frequency (RF) transmissions has recently attracted interest because deep learning approaches trained on both the signal of interest (SOI) and the signal mixture (SOI plus interference) can outperform traditional approaches which only consider the SOI. The goal is to detect, demodulate, and decode signals over a range of signal-to-interference-plus-noise (SINR) levels without having a detailed, design-level knowledge of the interfering signal or the propagation conditions. Our present AI interference suppression results are based on Autoregressive Transformer Decoder models which exhibit orders of magnitude faster throughput at inference time than WaveNet models developed in earlier work. As a specific example, we investigate an analog FM "Walkie Talkie" radio signal of interest in the presence of an Orthogonal Frequency-Division Multiplexing (OFDM) interferer. This type of interferer is near-ubiquitous in the current RF landscape. Our results clearly show the benefits of transformer-based interference mitigation in tactical settings. We show that unintelligible transmissions become intelligible via metrics such as Perceptual Evaluation of Speech Quality (PESQ), while overall latency is kept to a minimum using readily available lightweight GPUs such as a Jetson AGX Orin. We believe these same techniques can also be applied to a broader set of national security scenarios, as well as having commercial applications.
LGMay 23, 2024
Pure Exploration for Constrained Best Mixed Arm Identification with a Fixed BudgetDengwang Tang, Rahul Jain, Ashutosh Nayyar et al.
In this paper, we introduce the constrained best mixed arm identification (CBMAI) problem with a fixed budget. This is a pure exploration problem in a stochastic finite armed bandit model. Each arm is associated with a reward and multiple types of costs from unknown distributions. Unlike the unconstrained best arm identification problem, the optimal solution for the CBMAI problem may be a randomized mixture of multiple arms. The goal thus is to find the best mixed arm that maximizes the expected reward subject to constraints on the expected costs with a given learning budget $N$. We propose a novel, parameter-free algorithm, called the Score Function-based Successive Reject (SFSR) algorithm, that combines the classical successive reject framework with a novel score-function-based rejection criteria based on linear programming theory to identify the optimal support. We provide a theoretical upper bound on the mis-identification (of the the support of the best mixed arm) probability and show that it decays exponentially in the budget $N$ and some constants that characterize the hardness of the problem instance. We also develop an information theoretic lower bound on the error probability that shows that these constants appropriately characterize the problem difficulty. We validate this empirically on a number of average and hard instances.
LGJan 29
Model-Free Neural State Estimation in Nonlinear Dynamical Systems: A Comparative Study of Neural Architectures and Classical FiltersZhuochen Liu, Hans Walker, Rahul Jain
Neural network models are increasingly used for state estimation in control and decision-making problems, yet it remains unclear to what extent they behave as principled filters in nonlinear dynamical systems. Unlike classical filters, which rely on explicit knowledge of system dynamics and noise models, neural estimators can be trained purely from data without access to the underlying system equations. In this work, we present a systematic empirical comparison between such model-free neural network models and classical filtering methods across multiple nonlinear scenarios. Our study evaluates Transformer-based models, state-space neural networks, and recurrent architectures alongside particle filters and nonlinear Kalman filters. The results show that neural models (in particular, state-space models (SSMs)) achieve state estimation performance that approaches strong nonlinear Kalman filters in nonlinear scenarios and outperform weaker classical baselines despite lacking access to system models, while also attaining substantially higher inference throughput.
QUANT-PHMar 9
Quantum information advantage based on Bell inequalitiesRahul Jain, Srijita Kundu
Recently, Kretschmer et al. [KGD+25] presented an experimental demonstration of a proposed quantum information advantage protocol. We present an alternate proposal based on a relation derived from parallel-repeated CHSH games. Our memory measure is based on an information measure and is different from [KGD+25], where they count the number of qubits. Our proposal has an efficient verifier and a noise-robust quantum prover which is arguably much more efficient compared to [KGD+25].
ROSep 26, 2025
Self-driving cars: Are we there yet?Merve Atasever, Zhuochen Liu, Qingpei Li et al.
Autonomous driving remains a highly active research domain that seeks to enable vehicles to perceive dynamic environments, predict the future trajectories of traffic agents such as vehicles, pedestrians, and cyclists and plan safe and efficient future motions. To advance the field, several competitive platforms and benchmarks have been established to provide standardized datasets and evaluation protocols. Among these, leaderboards by the CARLA organization and nuPlan and the Waymo Open Dataset have become leading benchmarks for assessing motion planning algorithms. Each offers a unique dataset and challenging planning problems spanning a wide range of driving scenarios and conditions. In this study, we present a comprehensive comparative analysis of the motion planning methods featured on these three leaderboards. To ensure a fair and unified evaluation, we adopt CARLA leaderboard v2.0 as our common evaluation platform and modify the selected models for compatibility. By highlighting the strengths and weaknesses of current approaches, we identify prevailing trends, common challenges, and suggest potential directions for advancing motion planning research.
CVSep 15, 2025
DYNAMO: Dependency-Aware Deep Learning Framework for Articulated Assembly Motion PredictionMayank Patel, Rahul Jain, Asim Unmesh et al.
Understanding the motion of articulated mechanical assemblies from static geometry remains a core challenge in 3D perception and design automation. Prior work on everyday articulated objects such as doors and laptops typically assumes simplified kinematic structures or relies on joint annotations. However, in mechanical assemblies like gears, motion arises from geometric coupling, through meshing teeth or aligned axes, making it difficult for existing methods to reason about relational motion from geometry alone. To address this gap, we introduce MechBench, a benchmark dataset of 693 diverse synthetic gear assemblies with part-wise ground-truth motion trajectories. MechBench provides a structured setting to study coupled motion, where part dynamics are induced by contact and transmission rather than predefined joints. Building on this, we propose DYNAMO, a dependency-aware neural model that predicts per-part SE(3) motion trajectories directly from segmented CAD point clouds. Experiments show that DYNAMO outperforms strong baselines, achieving accurate and temporally consistent predictions across varied gear configurations. Together, MechBench and DYNAMO establish a novel systematic framework for data-driven learning of coupled mechanical motion in CAD assemblies.
LGMay 25, 2025
Reduce Computational Cost In Deep Reinforcement Learning Via Randomized Policy LearningZhuochen Liu, Rahul Jain, Quan Nguyen
Recent advancements in reinforcement learning (RL) have leveraged neural networks to achieve state-of-the-art performance across various control tasks. However, these successes often come at the cost of significant computational resources, as training deep neural networks requires substantial time and data. In this paper, we introduce an actor-critic algorithm that utilizes randomized neural networks to drastically reduce computational costs while maintaining strong performance. Despite its simple architecture, our method effectively solves a range of control problems, including the locomotion control of a highly dynamic 12-motor quadruped robot, and achieves results comparable to leading algorithms such as Proximal Policy Optimization (PPO). Notably, our approach does not outperform other algorithms in terms of sample efficnency but rather in terms of wall-clock training time. That is, although our algorithm requires more timesteps to converge to an optimal policy, the actual time required for training turns out to be lower.
LGJan 31, 2025
Best Policy Learning from Trajectory Preference FeedbackAkhil Agnihotri, Rahul Jain, Deepak Ramachandran et al.
Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset--potentially biased or out-of-distribution and collected from a rater of subpar 'competence'--with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning ($\mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.
LGJun 13, 2024
Online Bandit Learning with Offline Preference Data for Improved RLHFAkhil Agnihotri, Rahul Jain, Deepak Ramachandran et al.
Reinforcement Learning with Human Feedback (RLHF) is at the core of fine-tuning methods for generative AI models for language and images. Such feedback is often sought as rank or preference feedback from human raters, as opposed to eliciting scores since the latter tends to be noisy. On the other hand, RL theory and algorithms predominantly assume that a reward feedback is available. In particular, approaches for online learning that can be helpful in adaptive data collection via active learning cannot incorporate offline preference data. In this paper, we adopt a finite-armed linear bandit model as a prototypical model of online learning. We consider an offline preference dataset to be available generated by an expert of unknown 'competence'. We propose warmPref-PS, a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback. We show that by modeling the 'competence' of the expert that generated it, we are able to use such a dataset most effectively. We support our claims with novel theoretical analysis of its Bayesian regret, as well as, extensive empirical evaluation of an approximate loss function that optimizes for infinitely many arms, and performs substantially better than baselines.
LGJun 13, 2024
e-COP : Episodic Constrained Optimization of PoliciesAkhil Agnihotri, Rahul Jain, Deepak Ramachandran et al.
In this paper, we present the $\texttt{e-COP}$ algorithm, the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. Such formulations are applicable when there are separate sets of optimization criteria and constraints on a system's behavior. We approach this problem by first establishing a policy difference lemma for the episodic setting, which provides the theoretical foundation for the algorithm. Then, we propose to combine a set of established and novel solution ideas to yield the $\texttt{e-COP}$ algorithm that is easy to implement and numerically stable, and provide a theoretical guarantee on optimality under certain scaling assumptions. Through extensive empirical analysis using benchmarks in the Safety Gym suite, we show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting. The scalability of the algorithm opens the door to its application in safety-constrained Reinforcement Learning from Human Feedback for Large Language or Diffusion Models.
AIMay 24, 2023
Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision ProcessesKrishna C. Kalagarla, Dhruva Kartik, Dongming Shen et al.
Autonomous systems often have logical constraints arising, for example, from safety, operational, or regulatory requirements. Such constraints can be expressed using temporal logic specifications. The system state is often partially observable. Moreover, it could encompass a team of multiple agents with a common objective but disparate information structures and constraints. In this paper, we first introduce an optimal control theory for partially observable Markov decision processes (POMDPs) with finite linear temporal logic constraints. We provide a structured methodology for synthesizing policies that maximize a cumulative reward while ensuring that the probability of satisfying a temporal logic constraint is sufficiently high. Our approach comes with guarantees on approximate reward optimality and constraint satisfaction. We then build on this approach to design an optimal control framework for logically constrained multi-agent settings with information asymmetry. We illustrate the effectiveness of our approach by implementing it on several case studies.
CRFeb 27, 2022
Quantum secure non-malleable codes in the split-state modelDivesh Aggarwal, Naresh Goud Boddu, Rahul Jain
Non-malleable-codes introduced by Dziembowski, Pietrzak and Wichs [DPW18] encode a classical message $S$ in a manner such that tampering the codeword results in the decoder either outputting the original message $S$ or a message that is unrelated/independent of $S$. Providing such non-malleable security for various tampering function families has received significant attention in recent years. We consider the well-studied (2-part) split-state model, in which the message $S$ is encoded into two parts $X$ and $Y$, and the adversary is allowed to arbitrarily tamper with each $X$ and $Y$ individually. We consider the security of non-malleable-codes in the split-state model when the adversary is allowed to make use of arbitrary entanglement to tamper the parts $X$ and $Y$. We construct explicit quantum secure non-malleable-codes in the split-state model. Our construction of quantum secure non-malleable-codes is based on the recent construction of quantum secure $2$-source non-malleable-extractors by Boddu, Jain and Kapshikar [BJK21].
LGJan 31, 2022
Learning Infinite-Horizon Average-Reward Markov Decision Processes with ConstraintsLiyu Chen, Rahul Jain, Haipeng Luo
We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term, and show that for ergodic MDPs, our algorithm ensures $\widetilde{O}(\sqrt{T})$ regret and constant constraint violation, where $T$ is the total number of time steps. This strictly improves over the algorithm of (Singh et al., 2020), whose regret and constraint violation are both $\widetilde{O}(T^{2/3})$. Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with $\widetilde{O}(T^{2/3})$ regret and constraint violation, which can be further improved to $\widetilde{O}(\sqrt{T})$ via a simple modification, albeit making the algorithm computationally inefficient. As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.
LGDec 18, 2021
Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDPLiyu Chen, Rahul Jain, Haipeng Luo
We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first algorithm is computationally efficient and achieves a regret bound $\widetilde{O}\left(\sqrt{d^3B_{\star}^2T_{\star} K}\right)$, where $d$ is the dimension of the feature space, $B_{\star}$ and $T_{\star}$ are upper bounds of the expected costs and hitting time of the optimal policy respectively, and $K$ is the number of episodes. The same algorithm with a slight modification also achieves logarithmic regret of order $O\left(\frac{d^3B_{\star}^4}{c_{\min}^2\text{gap}_{\min}}\ln^5\frac{dB_{\star} K}{c_{\min}} \right)$, where $\text{gap}_{\min}$ is the minimum sub-optimality gap and $c_{\min}$ is the minimum cost over all state-action pairs. Our result is obtained by developing a simpler and improved analysis for the finite-horizon approximation of (Cohen et al., 2021) with a smaller approximation error, which might be of independent interest. On the other hand, using variance-aware confidence sets in a global optimization problem, our second algorithm is computationally inefficient but achieves the first "horizon-free" regret bound $\widetilde{O}(d^{3.5}B_{\star}\sqrt{K})$ with no polynomial dependency on $T_{\star}$ or $1/c_{\min}$, almost matching the $Ω(dB_{\star}\sqrt{K})$ lower bound from (Min et al., 2021).
SYSep 27, 2021
Model-Free Reinforcement Learning for Optimal Control of MarkovDecision Processes Under Signal Temporal Logic SpecificationsKrishna C. Kalagarla, Rahul Jain, Pierluigi Nuzzo
We present a model-free reinforcement learning algorithm to find an optimal policy for a finite-horizon Markov decision process while guaranteeing a desired lower bound on the probability of satisfying a signal temporal logic (STL) specification. We propose a method to effectively augment the MDP state space to capture the required state history and express the STL objective as a reachability objective. The planning problem can then be formulated as a finite-horizon constrained Markov decision process (CMDP). For a general finite horizon CMDP problem with unknown transition probability, we develop a reinforcement learning scheme that can leverage any model-free RL algorithm to provide an approximately optimal policy out of the general space of non-stationary randomized policies. We illustrate the effectiveness of our approach in the context of robotic motion planning for complex missions under uncertainty and performance objectives.
LGSep 8, 2021
A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with an Arbitrary OpponentMehdi Jafarnia-Jahromi, Rahul Jain, Ashutosh Nayyar
In this paper, we propose Posterior Sampling Reinforcement Learning for Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that achieves Bayesian regret bound of $O(HS\sqrt{AT})$ in the infinite-horizon zero-sum stochastic games with average-reward criterion. Here $H$ is an upper bound on the span of the bias function, $S$ is the number of states, $A$ is the number of joint actions and $T$ is the horizon. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. Our regret bound improves on the best existing regret bound of $O(\sqrt[3]{DS^2AT^2})$ by Wei et al. (2017) under the same assumption and matches the theoretical lower bound in $T$.
LGSep 7, 2021
Online Learning for Cooperative Multi-Player Multi-Armed BanditsWilliam Chang, Mehdi Jafarnia-Jahromi, Rahul Jain
We introduce a framework for decentralized online learning for multi-armed bandits (MAB) with multiple cooperative players. The reward obtained by the players in each round depends on the actions taken by all the players. It's a team setting, and the objective is common. Information asymmetry is what makes the problem interesting and challenging. We consider three types of information asymmetry: action information asymmetry when the actions of the players can't be observed but the rewards received are common; reward information asymmetry when the actions of the other players are observable but rewards received are IID from the same distribution; and when we have both action and reward information asymmetry. For the first setting, we propose a UCB-inspired algorithm that achieves $O(\log T)$ regret whether the rewards are IID or Markovian. For the second section, we offer an environment such that the algorithm given for the first setting gives linear regret. For the third setting, we show that a variation of the `explore then commit' algorithm achieves almost log regret.
CRSep 7, 2021
Quantum secure non-malleable-extractorsNaresh Goud Boddu, Rahul Jain, Upendra Kapshikar
We construct several explicit quantum secure non-malleable-extractors. All the quantum secure non-malleable-extractors we construct are based on the constructions by Chattopadhyay, Goyal and Li [2015] and Cohen [2015]. 1) We construct the first explicit quantum secure non-malleable-extractor for (source) min-entropy $k \geq \textsf{poly}\left(\log \left( \frac{n}ε \right)\right)$ ($n$ is the length of the source and $ε$ is the error parameter). Previously Aggarwal, Chung, Lin, and Vidick [2019] have shown that the inner-product based non-malleable-extractor proposed by Li [2012] is quantum secure, however it required linear (in $n$) min-entropy and seed length. Using the connection between non-malleable-extractors and privacy amplification (established first in the quantum setting by Cohen and Vidick [2017]), we get a $2$-round privacy amplification protocol that is secure against active quantum adversaries with communication $\textsf{poly}\left(\log \left( \frac{n}ε \right)\right)$, exponentially improving upon the linear communication required by the protocol due to [2019]. 2) We construct an explicit quantum secure $2$-source non-malleable-extractor for min-entropy $k \geq n- n^{Ω(1)}$, with an output of size $n^{Ω(1)}$ and error $2^{- n^{Ω(1)}}$. 3) We also study their natural extensions when the tampering of the inputs is performed $t$-times. We construct explicit quantum secure $t$-non-malleable-extractors for both seeded ($t=d^{Ω(1)}$) as well as $2$-source case ($t=n^{Ω(1)}$).
LGJun 15, 2021
Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest PathLiyu Chen, Mehdi Jafarnia-Jahromi, Rahul Jain et al.
We introduce a generic template for developing regret minimization algorithms in the Stochastic Shortest Path (SSP) model, which achieves minimax optimal regret as long as certain properties are ensured. The key of our analysis is a new technique called implicit finite-horizon approximation, which approximates the SSP model by a finite-horizon counterpart only in the analysis without explicit implementation. Using this template, we develop two new algorithms: the first one is model-free (the first in the literature to our knowledge) and minimax optimal under strictly positive costs; the second one is model-based and minimax optimal even with zero-cost state-action pairs, matching the best existing result from [Tarbouriech et al., 2021b]. Importantly, both algorithms admit highly sparse updates, making them computationally more efficient than all existing algorithms. Moreover, both can be made completely parameter-free.
LGJun 9, 2021
Online Learning for Stochastic Shortest Path Model via Posterior SamplingMehdi Jafarnia-Jahromi, Liyu Chen, Rahul Jain et al.
We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is followed during that epoch. An epoch completes if either the number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the state-action pairs is doubled. We establish a Bayesian regret bound of $O(B_\star S\sqrt{AK})$, where $B_\star$ is an upper bound on the expected cost of the optimal policy, $S$ is the size of the state space, $A$ is the size of the action space, and $K$ is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyper-parameters to tune. It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.
CRJun 5, 2021
Quantum Measurement AdversaryDivesh Aggarwal, Naresh Goud Boddu, Rahul Jain et al.
Multi-source-extractors are functions that extract uniform randomness from multiple (weak) sources of randomness. Quantum multi-source-extractors were considered by Kasher and Kempe (for the quantum-independent-adversary and the quantum-bounded-storage-adversary), Chung, Li and Wu (for the general-entangled-adversary) and Arnon-Friedman, Portmann and Scholz (for the quantum-Markov-adversary). One of the main objectives of this work is to unify all the existing quantum multi-source adversary models. We propose two new models of adversaries: 1) the quantum-measurement-adversary (qm-adv), which generates side-information using entanglement and on post-measurement and 2) the quantum-communication-adversary (qc-adv), which generates side-information using entanglement and communication between multiple sources. We show that, 1. qm-adv is the strongest adversary among all the known adversaries, in the sense that the side-information of all other adversaries can be generated by qm-adv. 2. The (generalized) inner-product function (in fact a general class of two-wise independent functions) continues to work as a good extractor against qm-adv with matching parameters as that of Chor and Goldreich. 3. A non-malleable-extractor proposed by Li (against classical-adversaries) continues to be secure against quantum side-information. This result implies a non-malleable-extractor result of Aggarwal, Chung, Lin and Vidick with uniform seed. We strengthen their result via a completely different proof to make the non-malleable-extractor of Li secure against quantum side-information even when the seed is not uniform. 4. A modification (working with weak sources instead of uniform sources) of the Dodis and Wichs protocol for privacy-amplification is secure against active quantum adversaries. This strengthens on a recent result due to Aggarwal, Chung, Lin and Vidick which uses uniform sources.
LGFeb 25, 2021
Online Learning for Unknown Partially Observable MDPsMehdi Jafarnia-Jahromi, Rahul Jain, Ashutosh Nayyar
Solving Partially Observable Markov Decision Processes (POMDPs) is hard. Learning optimal controllers for POMDPs when the model is unknown is harder. Online learning of optimal controllers for unknown POMDPs, which requires efficient learning using regret-minimizing algorithms that effectively tradeoff exploration and exploitation, is even harder, and no solution exists currently. In this paper, we consider infinite-horizon average-cost POMDPs with unknown transition model, though a known observation model. We propose a natural posterior sampling-based reinforcement learning algorithm (PSRL-POMDP) and show that it achieves a regret bound of $O(\log T)$, where $T$ is the time horizon, when the parameter set is finite. In the general case (continuous parameter set), we show that the algorithm achieves $O (T^{2/3})$ regret under two technical assumptions. To the best of our knowledge, this is the first online RL algorithm for POMDPs and has sub-linear regret.
LGOct 28, 2020
Designing Interpretable Approximations to Deep Reinforcement LearningNathan Dahlin, Krishna Chaitanya Kalagarla, Nikhil Naik et al.
In an ever expanding set of research and application areas, deep neural networks (DNNs) set the bar for algorithm performance. However, depending upon additional constraints such as processing power and execution time limits, or requirements such as verifiable safety guarantees, it may not be feasible to actually use such high-performing DNNs in practice. Many techniques have been developed in recent years to compress or distill complex DNNs into smaller, faster or more understandable models and controllers. This work seeks to identify reduced models that not only preserve a desired performance level, but also, for example, succinctly explain the latent knowledge represented by a DNN. We illustrate the effectiveness of the proposed approach on the evaluation of decision tree variants and kernel machines in the context of benchmark reinforcement learning tasks.
LGSep 23, 2020
A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with ConstraintsKrishna C. Kalagarla, Rahul Jain, Pierluigi Nuzzo
Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems whose objective is to minimize a cost function while satisfying constraints on various cost functions. In this paper, we consider the setting of episodic fixed-horizon CMDPs. We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning to provide a probably approximately correct (PAC) guarantee on the number of episodes needed to ensure an $ε$-optimal policy, i.e., with resulting objective value within $ε$ of the optimal value and satisfying the constraints within $ε$-tolerance, with probability at least $1-δ$. The number of episodes needed is shown to be of the order $\tilde{\mathcal{O}}\big(\frac{|S||A|C^{2}H^{2}}{ε^{2}}\log\frac{1}δ\big)$, where $C$ is the upper bound on the number of possible successor states for a state-action pair. Therefore, if $C \ll |S|$, the number of episodes needed have a linear dependence on the state and action space sizes $|S|$ and $|A|$, respectively, and quadratic dependence on the time horizon $H$.
LGJul 23, 2020
Learning Infinite-horizon Average-reward MDPs with Linear Function ApproximationChen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo et al.
We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $\widetilde{O}(\sqrt{T})$ regret and another computationally efficient variant with $\widetilde{O}(T^{3/4})$ regret, where $T$ is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $\widetilde{O}(\sqrt{T})$ regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with $\widetilde{O}(T^{2/3})$ regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).
LGJun 8, 2020
A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal RegretMehdi Jafarnia-Jahromi, Chen-Yu Wei, Rahul Jain et al.
Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation. In this paper, we propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) that achieves regret bound of $O(\sqrt{T})$ for the general class of weakly communicating MDPs, where $T$ is the number of interactions. EE-QL assumes that an online concentrating approximation of the optimal average reward is available. This is the first model-free learning algorithm that achieves $O(\sqrt T)$ regret without the ergodic assumption, and matches the lower bound in terms of $T$ except for logarithmic factors. Experiments show that the proposed algorithm performs as well as the best known model-based algorithms.
LGJun 8, 2020
Randomized Policy Learning for Continuous State and Action MDPsHiteshi Sharma, Rahul Jain
Deep reinforcement learning methods have achieved state-of-the-art results in a variety of challenging, high-dimensional domains ranging from video games to locomotion. The key to success has been the use of deep neural networks used to approximate the policy and value function. Yet, substantial tuning of weights is required for good results. We instead use randomized function approximation. Such networks are not only cheaper than training fully connected networks but also improve the numerical performance. We present \texttt{RANDPOL}, a generalized policy iteration algorithm for MDPs with continuous state and action spaces. Both the policy and value functions are represented with randomized networks. We also give finite time guarantees on the performance of the algorithm. Then we show the numerical performance on challenging environments and compare them with deep neural network based algorithms.
LGOct 15, 2019
Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision ProcessesChen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo et al.
Model-free reinforcement learning is known to be memory and computation efficient and more amendable to large scale problems. In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov Decision Processes (MDPs). The first algorithm reduces the problem to the discounted-reward version and achieves $\mathcal{O}(T^{2/3})$ regret after $T$ steps, under the minimal assumption of weakly communicating MDPs. To our knowledge, this is the first model-free algorithm for general MDPs in this setting. The second algorithm makes use of recent advances in adaptive algorithms for adversarial multi-armed bandits and improves the regret to $\mathcal{O}(\sqrt{T})$, albeit with a stronger ergodic assumption. This result significantly improves over the $\mathcal{O}(T^{3/4})$ regret achieved by the only existing model-free algorithm by Abbasi-Yadkori et al. (2019a) for ergodic MDPs in the infinite-horizon average-reward setting.