AIFeb 20, 2013

Probabilistic Exploration in Planning while Learning

arXiv:1302.4966v16 citations
Originality Incremental advance
AI Analysis

This addresses the scalability issue in exploration strategies for reinforcement learning, offering a more efficient method for tasks with large or infinite spaces, though it appears incremental as it builds on existing Q-learning frameworks.

The paper tackles the exploration-exploitation trade-off in reinforcement learning, particularly for Q-learning in large state-action spaces, by developing a probabilistic hill-climbing algorithm that uses statistical selection to ensure plans are arbitrarily close to locally optimal ones with high probability, and an experiment on a complex control task shows it outperforms a typical strategy.

Sequential decision tasks with incomplete information are characterized by the exploration problem; namely the trade-off between further exploration for learning more about the environment and immediate exploitation of the accrued information for decision-making. Within artificial intelligence, there has been an increasing interest in studying planning-while-learning algorithms for these decision tasks. In this paper we focus on the exploration problem in reinforcement learning and Q-learning in particular. The existing exploration strategies for Q-learning are of a heuristic nature and they exhibit limited scaleability in tasks with large (or infinite) state and action spaces. Efficient experimentation is needed for resolving uncertainties when possible plans are compared (i.e. exploration). The experimentation should be sufficient for selecting with statistical significance a locally optimal plan (i.e. exploitation). For this purpose, we develop a probabilistic hill-climbing algorithm that uses a statistical selection procedure to decide how much exploration is needed for selecting a plan which is, with arbitrarily high probability, arbitrarily close to a locally optimal one. Due to its generality the algorithm can be employed for the exploration strategy of robust Q-learning. An experiment on a relatively complex control task shows that the proposed exploration strategy performs better than a typical exploration strategy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes