LGJan 9, 2023
Minimax Weight Learning for Absorbing MDPsFengyin Li, Yuqiang Li, Xianyi Wu
Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon MDPs. In this paper, we study undiscounted off-policy policy evaluation for absorbing MDPs. Given the dataset consisting of the i.i.d episodes with a given truncation level, we propose a so-called MWLA algorithm to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound for the MWLA method is investigated and the dependence of statistical errors on the data size and the truncation level are analyzed. With an episodic taxi environment, computational experiments illustrate the performance of the MWLA algorithm.
MLMay 15
Pessimistic Risk-Aware Policy Learning in Contextual BanditsYilong Wan, Yuqiang Li, Xianyi Wu
We study risk-aware offline policy learning, aiming to learn a decision rule from logged data that is optimal under general risk criteria. This problem is crucial in high-stakes domains where online interaction is infeasible and adverse outcomes must be carefully controlled. However, existing literature on offline contextual bandits either centers on expected-reward criteria or restricts risk considerations to policy evaluation instead of optimization. In this work, we propose a unified distributional framework for optimizing Lipschitz-continuous risk functionals, a broad class of risk measures encompassing mean-variance, entropic risk, and conditional value-at-risk, among others. By developing novel empirical concentration inequalities for importance sampling-based distributional estimators, our analysis derives data-dependent suboptimality bounds with an $\tilde{\mathcal{O}}(1/\sqrt{n})$ rate, without relying on restrictive uniform overlap assumptions. This rate is minimax optimal and matches that of risk-neutral offline policy optimization, indicating that optimizing general Lipschitz risk criteria incurs no additional statistical cost relative to the expected-reward.
LGMay 8
Improved Model-based Reinforcement Learning with Smooth KernelsKun Long, Yuqiang Li, Xianyi Wu
For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive structural assumptions. Kernel smoothing model-based approaches offer a promising alternative paradigm that instead leverages the smoothness of the MDP and employs non-parametric kernel smoothing estimates of transition dynamics. This paper proposes a new kernel-smoothing model-based approach for online reinforcement learning in finite-horizon settings under Lipschitz continuity assumptions on the MDP. By incorporating a Bernstein-style exploration bonus into the kernel smoothing framework, our method achieves a regret bound which improves upon the state-of-the-art regret bound in its dependence on the horizon. The theoretical advancement relies on a delicate analysis of the synergy between Bernstein-style bonuses and kernel smoothing, where a new tight Bernstein-type concentration inequality for martingales may be of independent interest.
MLJul 22, 2025
PAC Off-Policy Prediction of Contextual BanditsYilong Wan, Yuqiang Li, Xianyi Wu
This paper investigates off-policy evaluation in contextual bandits, aiming to quantify the performance of a target policy using data collected under a different and potentially unknown behavior policy. Recently, methods based on conformal prediction have been developed to construct reliable prediction intervals that guarantee marginal coverage in finite samples, making them particularly suited for safety-critical applications. To further achieve coverage conditional on a given offline data set, we propose a novel algorithm that constructs probably approximately correct prediction intervals. Our method builds upon a PAC-valid conformal prediction framework, and we strengthen its theoretical guarantees by establishing PAC-type bounds on coverage. We analyze both finite-sample and asymptotic properties of the proposed method, and compare its empirical performance with existing methods in simulations.
PRAug 20, 2018
A General Framework of Multi-Armed Bandit Processes by Arm Switch RestrictionsWenqing Bao, Xiaoqiang Cai, Xianyi Wu
This paper proposes a general framework of multi-armed bandit (MAB) processes by introducing a type of restrictions on the switches among arms evolving in continuous time. The Gittins index process is constructed for any single arm subject to the restrictions on switches and then the optimality of the corresponding Gittins index rule is established. The Gittins indices defined in this paper are consistent with the ones for MAB processes in continuous time, integer time, semi-Markovian setting as well as general discrete time setting, so that the new theory covers the classical models as special cases and also applies to many other situations that have not yet been touched in the literature. While the proof of the optimality of Gittins index policies benefits from ideas in the existing theory of MAB processes in continuous time, new techniques are introduced which drastically simplify the proof.