LG MLMay 8

Conformal-Style Quantile Analyses for Stochastic Bandits

arXiv:2605.0711523.8

AI Analysis

For researchers in stochastic bandits, this work addresses the mismatch between mean-based analysis and applications requiring strong upper-tail performance, offering a principled approach with theoretical guarantees.

The paper introduces ACP-UCB1, a conformal-style bandit algorithm that targets upper-tail performance rather than mean reward, achieving logarithmic upper-quantile regret with per-arm contribution O(log n / Δ_j^{ACP}).

Stochastic bandit algorithms are usually analyzed under a mean-reward criterion, yet many problems favor arms with strong upper-tail performance, which we study herein. For a fixed miscoverage level \(α\), the natural upper-tail target of arm \(j\) is the upper endpoint \(F_j^{-1}(1-α/2)\) of a central prediction interval. This target can rank arms differently from their means, creating a central mismatch with the classical bandit objective. To this end, we propose ACP-UCB1, a conformal-style policy that combines an adaptive conformal estimate of the upper endpoint with a UCB-type optimism bonus. The technical challenge is that the conformity scores used by ACP-UCB1 are recomputed from evolving empirical quantile estimates and evaluated at an adaptive level. We control this endpoint through reward-quantile concentration, a perturbation argument for recomputed score quantiles, and deterministic localization of the adaptive level. ACP-UCB1 achieves logarithmic upper-quantile regret with per-arm contribution \(O(\nicefrac{\log n}{Δ_j^{\mathrm{ACP}}})\). We also provide metric-specific regret decompositions comparing ACP-UCB1 with UCB1 and use numerical experiments to validate performance and improvement.

View on arXiv PDF

Similar