LG AIDec 24, 2020

A Regret bound for Non-stationary Multi-Armed Bandits with Fairness Constraints

arXiv:2012.13380v12.31 citations

Originality Highly original

AI Analysis

This work provides the first fair algorithm with a sublinear regret bound for non-stationary multi-armed bandits, which is significant for researchers and practitioners dealing with dynamic environments where fairness is a critical concern.

This paper addresses the problem of non-stationary multi-armed bandits with a stringent fairness constraint, where a poorly performing candidate should not be preferred over a better one. The authors propose Fair-UCBe, an algorithm that satisfies this fairness condition and achieves a regret bound of O(k^(3/2) T^(1 - α/2) √(log T)) for a slowly varying stochastic k-armed bandit problem.

The multi-armed bandits' framework is the most common platform to study strategies for sequential decision-making problems. Recently, the notion of fairness has attracted a lot of attention in the machine learning community. One can impose the fairness condition that at any given point of time, even during the learning phase, a poorly performing candidate should not be preferred over a better candidate. This fairness constraint is known to be one of the most stringent and has been studied in the stochastic multi-armed bandits' framework in a stationary setting for which regret bounds have been established. The main aim of this paper is to study this problem in a non-stationary setting. We present a new algorithm called Fair Upper Confidence Bound with Exploration Fair-UCBe algorithm for solving a slowly varying stochastic $k$-armed bandit problem. With this we present two results: (i) Fair-UCBe indeed satisfies the above mentioned fairness condition, and (ii) it achieves a regret bound of $O\left(k^{\frac{3}{2}} T^{1 - \fracα{2}} \sqrt{\log T}\right)$, for some suitable $α\in (0, 1)$, where $T$ is the time horizon. This is the first fair algorithm with a sublinear regret bound applicable to non-stationary bandits to the best of our knowledge. We show that the performance of our algorithm in the non-stationary case approaches that of its stationary counterpart as the variation in the environment tends to zero.

View on arXiv PDF

Similar