Combinatorial Rising Bandit
This work addresses a gap in rising bandit models for applications like robotics and recommendation systems, but it is incremental as it extends existing frameworks to handle combinatorial dependencies.
The paper tackles the problem of combinatorial online learning with rising rewards, where playing a base arm enhances future rewards and affects multiple super arms, by introducing the Combinatorial Rising Bandit (CRB) framework and proposing the CRUCB algorithm with provable regret bounds and empirical validation in synthetic and deep reinforcement learning environments.
Combinatorial online learning is a fundamental task for selecting the optimal action (or super arm) as a combination of base arms in sequential interactions with systems providing stochastic rewards. It is applicable to diverse domains such as robotics, social advertising, network routing, and recommendation systems. In many real-world scenarios, we often encounter rising rewards, where playing a base arm not only provides an instantaneous reward but also contributes to the enhancement of future rewards, e.g., robots enhancing proficiency through practice and social influence strengthening in the history of successful recommendations. Moreover, the enhancement of a single base arm may affect multiple super arms that include it, introducing complex dependencies that are not captured by existing rising bandit models. To address this, we introduce the Combinatorial Rising Bandit (CRB) framework and propose a provably efficient algorithm, Combinatorial Rising Upper Confidence Bound (CRUCB). We establish an upper bound on regret CRUCB and show that it is nearly tight by deriving a matching lower bound. In addition, we empirically demonstrate the effectiveness of CRUCB not only in synthetic environments but also in realistic applications of deep reinforcement learning.