MLAILGOCJan 30, 2017

Reinforcement Learning Algorithm Selection

arXiv:1701.08810v317 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of algorithm selection in RL for practitioners, but it is incremental as it builds on existing bandit and meta-learning approaches.

The paper tackles the problem of online algorithm selection for reinforcement learning by introducing ESBAS, a meta-algorithm that selects among off-policy RL algorithms to maximize expected return, and shows it outperforms individual algorithms on a dialogue task and adapts efficiently in online settings like Atari games with wide performance improvements.

This paper formalises the problem of online algorithm selection in the context of Reinforcement Learning. The setup is as follows: given an episodic task and a finite number of off-policy RL algorithms, a meta-algorithm has to decide which RL algorithm is in control during the next episode so as to maximize the expected return. The article presents a novel meta-algorithm, called Epochal Stochastic Bandit Algorithm Selection (ESBAS). Its principle is to freeze the policy updates at each epoch, and to leave a rebooted stochastic bandit in charge of the algorithm selection. Under some assumptions, a thorough theoretical analysis demonstrates its near-optimality considering the structural sampling budget limitations. ESBAS is first empirically evaluated on a dialogue task where it is shown to outperform each individual algorithm in most configurations. ESBAS is then adapted to a true online setting where algorithms update their policies after each transition, which we call SSBAS. SSBAS is evaluated on a fruit collection task where it is shown to adapt the stepsize parameter more efficiently than the classical hyperbolic decay, and on an Atari game, where it improves the performance by a wide margin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes