LGFeb 19, 2025

Continuous K-Max Bandits

Yu Chen, Siwei Wang, Longbo Huang, Wei Chen

arXiv:2502.13467v19.41 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses a critical problem in applications like recommendation systems and distributed computing, offering incremental improvements by extending bandit theory to continuous settings with specific feedback structures.

The paper tackles the K-Max combinatorial multi-armed bandits problem with continuous outcome distributions and weak value-index feedback, proposing algorithms DCK-UCB and MLE-Exp that achieve sublinear regret bounds of O~(T^{3/4}) for general distributions and O~(sqrt(T)) for exponential distributions, respectively.

We study the $K$-Max combinatorial multi-armed bandits problem with continuous outcome distributions and weak value-index feedback: each base arm has an unknown continuous outcome distribution, and in each round the learning agent selects $K$ arms, obtains the maximum value sampled from these $K$ arms as reward and observes this reward together with the corresponding arm index as feedback. This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc. The continuous $K$-Max bandits introduce unique challenges, including discretization error from continuous-to-discrete conversion, non-deterministic tie-breaking under limited feedback, and biased estimation due to partial observability. Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds to tackle these challenges. For general continuous distributions, we prove that DCK-UCB achieves a $\widetilde{\mathcal{O}}(T^{3/4})$ regret upper bound, establishing the first sublinear regret guarantee for this setting. Furthermore, we identify an important special case with exponential distributions under full-bandit feedback. In this case, our proposed algorithm MLE-Exp enables $\widetilde{\mathcal{O}}(\sqrt{T})$ regret upper bound through maximal log-likelihood estimation, achieving near-minimax optimality.

View on arXiv PDF

Similar