Cost-Ordered Feasibility for Multi-Armed Bandits with Cost Subsidy

Ishank Juneja, Carlee Joe-Wong, Osman Yağan

arXiv:2605.0717127.8

AI Analysis

For practitioners in recommendation systems and resource allocation, this work provides a more efficient algorithm for cost-sensitive bandit problems with reward constraints.

This paper addresses the multi-armed bandit with cost subsidy (MAB-CS) problem, where the goal is to minimize cost subject to a reward constraint relative to the unknown best arm. The proposed Cost-Ordered Feasibility (COF) algorithm achieves improved theoretical regret bounds and demonstrates superior empirical performance on MovieLens and Goodreads datasets compared to baselines.

The classic multi-armed bandit (MAB) problem tackles the challenge of accruing maximum reward while making decisions under uncertainty. However, in applications, often the goal is to minimize cost subject to a constraint on the minimum permissible reward, an objective captured by multi-armed bandits with cost-subsidy (MAB-CS). Of interest to this paper is the setting where the quality (reward) constraint is specified relative to the unknown best reward and the cost of each arm is known. We characterize the expected sub-optimal samples required by any policy by proving instance-dependent lower bounds that offer new insight into the problem and are a strict generalization of prior bounds. Then, we propose an algorithm called Cost-Ordered Feasibility (COF) that leverages our insight and intelligently combine samples from all arms to gauge the feasibility of a cheap arm. Thereafter, we analyze COF to establish instance-dependent upper bounds on its expected cumulative cost and quality regret, i.e., relative to the cheapest feasible arm. Finally, we empirically validate the merits of COF, comparing it to baselines from the literature through extensive simulation experiments on the MovieLens and Goodreads datasets as well as representative synthetic instances. Not only does our paper develop qualitatively better theoretical regret upper bounds, but COF also convincingly demonstrates improved empirical performance.

View on arXiv PDF

Similar