LGJul 7, 2021

Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes

Nihal Sharma, Rajat Sen, Soumya Basu, Karthikeyan Shanmugam, Sanjay Shakkottai

arXiv:2107.03263v31.6

Originality Incremental advance

AI Analysis

This work addresses efficient decision-making in bandit problems with expert advice, offering theoretical guarantees for both static and dynamic environments, though it is incremental in building on existing bandit and expert frameworks.

The paper tackles the contextual bandit problem with stochastic expert policies, proposing algorithms that achieve constant regret bounds independent of the time horizon and scale linearly with the number of experts, and extend to episodic settings with regret scaling as O(E(N+1) + N√E/T²).

We study a variant of the contextual bandit problem where an agent can intervene through a set of stochastic expert policies. Given a fixed context, each expert samples actions from a fixed conditional distribution. The agent seeks to remain competitive with the 'best' among the given set of experts. We propose the Divergence-based Upper Confidence Bound (D-UCB) algorithm that uses importance sampling to share information across experts and provide horizon-independent constant regret bounds that only scale linearly in the number of experts. We also provide the Empirical D-UCB (ED-UCB) algorithm that can function with only approximate knowledge of expert distributions. Further, we investigate the episodic setting where the agent interacts with an environment that changes over episodes. Each episode can have different context and reward distributions resulting in the best expert changing across episodes. We show that by bootstrapping from $\mathcal{O}\left(N\log\left(NT^2\sqrt{E}\right)\right)$ samples, ED-UCB guarantees a regret that scales as $\mathcal{O}\left(E(N+1) + \frac{N\sqrt{E}}{T^2}\right)$ for $N$ experts over $E$ episodes, each of length $T$. We finally empirically validate our findings through simulations.

View on arXiv PDF

Similar