Lower Bounds for Multi-armed Bandit with Non-equivalent Multiple Plays
This work addresses a specific variant of the multi-armed bandit problem, offering theoretical guarantees for regret optimization in sequential decision-making scenarios, but it appears incremental as it extends existing frameworks with ordering considerations.
The paper tackles the stochastic multi-armed bandit problem with non-equivalent multiple plays, where an agent selects and orders arms affecting rewards, and provides lower bounds for regret with O(log t) asymptotics and novel coefficients, along with optimal algorithms proving these bounds are tight.
We study the stochastic multi-armed bandit problem with non-equivalent multiple plays where, at each step, an agent chooses not only a set of arms, but also their order, which influences reward distribution. In several problem formulations with different assumptions, we provide lower bounds for regret with standard asymptotics $O(\log{t})$ but novel coefficients and provide optimal algorithms, thus proving that these bounds cannot be improved.