Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds
This work addresses the challenge of making RL more practical by providing insights into problem difficulty and reducing barriers to application, though it is incremental in refining existing theoretical bounds.
The paper tackles the problem of deriving tighter, problem-dependent regret bounds in reinforcement learning without requiring prior domain knowledge, achieving state-of-the-art worst-case bounds and substantially improved bounds when the environment has small variance in value functions, with a leading term independent of horizon H.
Strong worst-case performance bounds for episodic reinforcement learning exist but fortunately in practice RL algorithms perform much better than such bounds would predict. Algorithms and theory that provide strong problem-dependent bounds could help illuminate the key features of what makes a RL problem hard and reduce the barrier to using RL algorithms in practice. As a step towards this we derive an algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the next-state value functions. An important benefit of our algorithmic is that it does not require apriori knowledge of a bound on the environmental norm. As a result of our analysis, we also help address an open learning theory question~\cite{jiang2018open} about episodic MDPs with a constant upper-bound on the sum of rewards, providing a regret bound with no $H$-dependence in the leading term that scales a polynomial function of the number of episodes.