LGDSMLSep 24, 2022

Non-monotonic Resource Utilization in the Bandits with Knapsacks Problem

arXiv:2209.12013v115 citationsh-index: 62
Originality Incremental advance
AI Analysis

This work addresses sequential decision-making under resource constraints for applications like online advertising or resource management, representing an incremental extension of the existing BwK model.

The paper tackles the bandits with knapsacks problem by generalizing it to allow non-monotonic resource utilization, where resource drifts can be positive or negative, and develops a policy with constant regret when distributions are known and a learning algorithm with logarithmic regret when they are unknown.

Bandits with knapsacks (BwK) is an influential model of sequential decision-making under uncertainty that incorporates resource consumption constraints. In each round, the decision-maker observes an outcome consisting of a reward and a vector of nonnegative resource consumptions, and the budget of each resource is decremented by its consumption. In this paper we introduce a natural generalization of the stochastic BwK problem that allows non-monotonic resource utilization. In each round, the decision-maker observes an outcome consisting of a reward and a vector of resource drifts that can be positive, negative or zero, and the budget of each resource is incremented by its drift. Our main result is a Markov decision process (MDP) policy that has constant regret against a linear programming (LP) relaxation when the decision-maker knows the true outcome distributions. We build upon this to develop a learning algorithm that has logarithmic regret against the same LP relaxation when the decision-maker does not know the true outcome distributions. We also present a reduction from BwK to our model that shows our regret bound matches existing results.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes