LG MLJan 23, 2019

Meta-Learning for Contextual Bandit Exploration

arXiv:1901.08159v112.520 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of efficient exploration in interactive decision-making for applications like recommendation systems, though it is incremental as it builds on existing meta-learning and bandit methods.

The paper tackles the exploration-exploitation trade-off in contextual bandits by proposing MELEE, a meta-learning algorithm that learns exploration policies from synthetic data and applies them to real tasks, outperforming seven baselines on 300 datasets, particularly when reward differences are large.

We describe MELEE, a meta-learning algorithm for learning a good exploration policy in the interactive contextual bandit setting. Here, an algorithm must take actions based on contexts, and learn based only on a reward signal from the action taken, thereby generating an exploration/exploitation trade-off. MELEE addresses this trade-off by learning a good exploration strategy for offline tasks based on synthetic data, on which it can simulate the contextual bandit setting. Based on these simulations, MELEE uses an imitation learning strategy to learn a good exploration policy that can then be applied to true contextual bandit tasks at test time. We compare MELEE to seven strong baseline contextual bandit algorithms on a set of three hundred real-world datasets, on which it outperforms alternatives in most settings, especially when differences in rewards are large. Finally, we demonstrate the importance of having a rich feature representation for learning how to explore.

View on arXiv PDF

Similar