LG MLNov 5, 2018

Multi-armed Bandits with Compensation

arXiv:1811.01715v19.132 citations

Originality Incremental advance

AI Analysis

This addresses incentive design in multi-armed bandit systems for applications like online platforms, but it is incremental as it builds on existing bandit frameworks with compensation.

The paper tackles the known-compensation multi-armed bandit problem by incentivizing short-term players to explore arms with payments, achieving O(log T) regret and compensation that match a theoretical lower bound.

We propose and study the known-compensation multi-arm bandit (KCMAB) problem, where a system controller offers a set of arms to many short-term players for $T$ steps. In each step, one short-term player arrives to the system. Upon arrival, the player aims to select an arm with the current best average reward and receives a stochastic reward associated with the arm. In order to incentivize players to explore other arms, the controller provides a proper payment compensation to players. The objective of the controller is to maximize the total reward collected by players while minimizing the compensation. We first provide a compensation lower bound $Θ(\sum_i {Δ_i\log T\over KL_i})$, where $Δ_i$ and $KL_i$ are the expected reward gap and Kullback-Leibler (KL) divergence between distributions of arm $i$ and the best arm, respectively. We then analyze three algorithms to solve the KCMAB problem, and obtain their regrets and compensations. We show that the algorithms all achieve $O(\log T)$ regret and $O(\log T)$ compensation that match the theoretical lower bound. Finally, we present experimental results to demonstrate the performance of the algorithms.

View on arXiv PDF

Similar