Sample Complexity of an Adversarial Attack on UCB-based Best-arm Identification Policy
This addresses security vulnerabilities in bandit algorithms for decision-making systems, but it is incremental as it builds directly on prior attack models and algorithms.
The paper tackles the problem of adversarial perturbations to rewards in a multi-armed bandit setting, specifically deriving the sample complexity required for an attack to make a target arm be selected as the best arm by a UCB-based policy, with the result showing that the stopping condition can be achieved in T rounds depending on the number of arms and reward distribution parameters.
In this work I study the problem of adversarial perturbations to rewards, in a Multi-armed bandit (MAB) setting. Specifically, I focus on an adversarial attack to a UCB type best-arm identification policy applied to a stochastic MAB. The UCB attack presented in [1] results in pulling a target arm K very often. I used the attack model of [1] to derive the sample complexity required for selecting target arm K as the best arm. I have proved that the stopping condition of UCB based best-arm identification algorithm given in [2], can be achieved by the target arm K in T rounds, where T depends only on the total number of arms and $σ$ parameter of $σ^2-$ sub-Gaussian random rewards of the arms.