Ziniu Li

h-index3

3papers

13citations

Novelty40%

AI Score21

Ranked #183,438 of 194,257 authors (top 94%)#39,072 in LG (top 97%)

3 Papers

3.8LGMar 13, 2023

Deploying Offline Reinforcement Learning with Human Feedback

Ziniu Li, Ke Xu, Liu Liu et al.

Reinforcement learning (RL) has shown promise for decision-making tasks in real-world applications. One practical framework involves training parameterized policy models from an offline dataset and subsequently deploying them in an online environment. However, this approach can be risky since the offline training may not be perfect, leading to poor performance of the RL models that may take dangerous actions. To address this issue, we propose an alternative framework that involves a human supervising the RL models and providing additional feedback in the online deployment phase. We formalize this online deployment problem and develop two approaches. The first approach uses model selection and the upper confidence bound algorithm to adaptively select a model to deploy from a candidate set of trained offline RL models. The second approach involves fine-tuning the model in the online deployment phase when a supervision signal arrives. We demonstrate the effectiveness of these approaches for robot locomotion control and traffic light control tasks through empirical validation.

7.8LGMar 22, 2022

A Note on Target Q-learning For Solving Finite MDPs with A Generative Oracle

Ziniu Li, Tian Xu, Yang Yu

Q-learning with function approximation could diverge in the off-policy setting and the target network is a powerful technique to address this issue. In this manuscript, we examine the sample complexity of the associated target Q-learning algorithm in the tabular case with a generative oracle. We point out a misleading claim in [Lee and He, 2020] and establish a tight analysis. In particular, we demonstrate that the sample complexity of the target Q-learning algorithm in [Lee and He, 2020] is $\widetilde{\mathcal O}(|\mathcal S|^2|\mathcal A|^2 (1-γ)^{-5}\varepsilon^{-2})$. Furthermore, we show that this sample complexity is improved to $\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-γ)^{-5}\varepsilon^{-2})$ if we can sequentially update all state-action pairs and $\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-γ)^{-4}\varepsilon^{-2})$ if $γ$ is further in $(1/2, 1)$. Compared with the vanilla Q-learning, our results conclude that the introduction of a periodically-frozen target Q-function does not sacrifice the sample complexity.

4.8LGNov 16, 2019

On Value Discrepancy of Imitation Learning

Tian Xu, Ziniu Li, Yang Yu

Imitation learning trains a policy from expert demonstrations. Imitation learning approaches have been designed from various principles, such as behavioral cloning via supervised learning, apprenticeship learning via inverse reinforcement learning, and GAIL via generative adversarial learning. In this paper, we propose a framework to analyze the theoretical property of imitation learning approaches based on discrepancy propagation analysis. Under the infinite-horizon setting, the framework leads to the value discrepancy of behavioral cloning in an order of O((1-γ)^{-2}). We also show that the framework leads to the value discrepancy of GAIL in an order of O((1-γ)^{-1}). It implies that GAIL has less compounding errors than behavioral cloning, which is also verified empirically in this paper. To the best of our knowledge, we are the first one to analyze GAIL's performance theoretically. The above results indicate that the proposed framework is a general tool to analyze imitation learning approaches. We hope our theoretical results can provide insights for future improvements in imitation learning algorithms.