Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting
This work addresses the problem of efficient reinforcement learning in complex factored environments for researchers and practitioners, offering incremental improvements with tighter theoretical bounds and practical algorithms.
The paper tackles reinforcement learning in non-episodic factored Markov decision processes (FMDPs) by proposing two near-optimal, oracle-efficient algorithms with Bayesian and frequentist regret bounds, and introduces a tighter connectivity measure called factored span to improve lower bounds, showing performance gains in computer network simulations.
We study reinforcement learning in non-episodic factored Markov decision processes (FMDPs). We propose two near-optimal and oracle-efficient algorithms for FMDPs. Assuming oracle access to an FMDP planner, they enjoy a Bayesian and a frequentist regret bound respectively, both of which reduce to the near-optimal bound $\widetilde{O}(DS\sqrt{AT})$ for standard non-factored MDPs. We propose a tighter connectivity measure, factored span, for FMDPs and prove a lower bound that depends on the factored span rather than the diameter $D$. In order to decrease the gap between lower and upper bounds, we propose an adaptation of the REGAL.C algorithm whose regret bound depends on the factored span. Our oracle-efficient algorithms outperform previously proposed near-optimal algorithms on computer network administration simulations.