LG MLSep 12, 2017

Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits

arXiv:1709.04004v27.730 citations

Originality Incremental advance

AI Analysis

This work addresses a domain-specific problem in bandit algorithms for scenarios with fluctuating costs, offering incremental improvements over existing methods.

The paper tackles the problem of opportunistic bandits, where regret varies with environmental conditions like network load, by proposing the AdaUCB algorithm to adaptively balance exploration and exploitation, achieving O(log T) regret with a smaller coefficient than traditional UCB and O(1) regret under zero exploration cost conditions.

In this paper, we propose and study opportunistic bandits - a new variant of bandits where the regret of pulling a suboptimal arm varies under different environmental conditions, such as network load or produce price. When the load/price is low, so is the cost/regret of pulling a suboptimal arm (e.g., trying a suboptimal network configuration). Therefore, intuitively, we could explore more when the load/price is low and exploit more when the load/price is high. Inspired by this intuition, we propose an Adaptive Upper-Confidence-Bound (AdaUCB) algorithm to adaptively balance the exploration-exploitation tradeoff for opportunistic bandits. We prove that AdaUCB achieves $O(\log T)$ regret with a smaller coefficient than the traditional UCB algorithm. Furthermore, AdaUCB achieves $O(1)$ regret with respect to $T$ if the exploration cost is zero when the load level is below a certain threshold. Last, based on both synthetic data and real-world traces, experimental results show that AdaUCB significantly outperforms other bandit algorithms, such as UCB and TS (Thompson Sampling), under large load/price fluctuations.

View on arXiv PDF

Similar