Improved Algorithms for Nash Welfare in Linear Bandits
This work addresses fairness-aware performance in linear bandits, providing improved algorithms for researchers and practitioners in multi-armed bandit applications, though it is incremental as it builds on existing Nash regret concepts.
The paper tackles the problem of suboptimal Nash regret bounds in linear bandits by introducing new analytical tools that achieve order-optimal Nash regret, and extends this to a unifying p-means regret framework that generalizes Nash regret, with experiments showing consistent outperformance of state-of-the-art baselines on real-world datasets.
Nash regret has recently emerged as a principled fairness-aware performance metric for stochastic multi-armed bandits, motivated by the Nash Social Welfare objective. Although this notion has been extended to linear bandits, existing results suffer from suboptimality in ambient dimension $d$, stemming from proof techniques that rely on restrictive concentration inequalities. In this work, we resolve this open problem by introducing new analytical tools that yield an order-optimal Nash regret bound in linear bandits. Beyond Nash regret, we initiate the study of $p$-means regret in linear bandits, a unifying framework that interpolates between fairness and utility objectives and strictly generalizes Nash regret. We propose a generic algorithmic framework, FairLinBandit, that works as a meta-algorithm on top of any linear bandit strategy. We instantiate this framework using two bandit algorithms: Phased Elimination and Upper Confidence Bound, and prove that both achieve sublinear $p$-means regret for the entire range of $p$. Extensive experiments on linear bandit instances generated from real-world datasets demonstrate that our methods consistently outperform the existing state-of-the-art baseline.