Near-optimal Reinforcement Learning in Factored MDPs
This addresses the curse of dimensionality in reinforcement learning for practical applications with large state spaces, offering a significant improvement over general MDP methods.
The paper tackles the problem of reinforcement learning in factored MDPs, where traditional methods suffer from exponential regret due to large state and action spaces, and shows that near-optimal regret scaling polynomially with the number of parameters can be achieved using algorithms like PSRL and UCRL-Factored.
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $Ω(\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action spaces. This implies $T = Ω(SA)$ time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, $S$ and $A$ can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a \emph{factored} MDP, it is possible to achieve regret that scales polynomially in the number of \emph{parameters} encoding the factored MDP, which may be exponentially smaller than $S$ or $A$. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).