LG MLApr 1, 2022

Strategies for Safe Multi-Armed Bandits with Logarithmic Regret and Risk

Tianrui Chen, Aditya Gangrade, Venkatesh Saligrama

arXiv:2204.00706v115.118 citationsh-index: 46

Originality Incremental advance

AI Analysis

This addresses safety-critical applications like clinical trials by ensuring per-round safety, though it is incremental as it builds on existing bandit methods with new constraints.

The paper tackles the multi-armed bandit problem under safety constraints, where arms have unknown risks and rewards, and the goal is to maximize reward while avoiding unsafe arms based on a risk threshold. It proposes doubly optimistic strategies that achieve tight logarithmic regret bounds and limit unsafe arm plays to logarithmic times, with simulation studies validating their effectiveness.

We investigate a natural but surprisingly unstudied approach to the multi-armed bandit problem under safety risk constraints. Each arm is associated with an unknown law on safety risks and rewards, and the learner's goal is to maximise reward whilst not playing unsafe arms, as determined by a given threshold on the mean risk. We formulate a pseudo-regret for this setting that enforces this safety constraint in a per-round way by softly penalising any violation, regardless of the gain in reward due to the same. This has practical relevance to scenarios such as clinical trials, where one must maintain safety for each round rather than in an aggregated sense. We describe doubly optimistic strategies for this scenario, which maintain optimistic indices for both safety risk and reward. We show that schema based on both frequentist and Bayesian indices satisfy tight gap-dependent logarithmic regret bounds, and further that these play unsafe arms only logarithmically many times in total. This theoretical analysis is complemented by simulation studies demonstrating the effectiveness of the proposed schema, and probing the domains in which their use is appropriate.

View on arXiv PDF

Similar