LGSYAug 30, 2025

Lagrangian Relaxation for Multi-Action Partially Observable Restless Bandits: Heuristic Policies and Indexability

arXiv:2509.00415v11 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses a complex optimization problem with applications in recommendation systems and healthcare, but it appears incremental as it extends existing restless bandit frameworks to multi-action settings without claiming major breakthroughs.

The paper tackles the multi-action partially observable restless multi-armed bandit problem, a generalization with finite states and actions where states are unobservable, motivated by public-health intervention planning. It analyzes Lagrangian bounds using approximations like point-based value iteration and online rollout policies, and discusses heuristic policies and Whittle index limitations.

Partially observable restless multi-armed bandits have found numerous applications including in recommendation systems, communication systems, public healthcare outreach systems, and in operations research. We study multi-action partially observable restless multi-armed bandits, it is a generalization of the classical restless multi-armed bandit problem -- 1) each bandit has finite states, and the current state is not observable, 2) each bandit has finite actions. In particular, we assume that more than two actions are available for each bandit. We motivate our problem with the application of public-health intervention planning. We describe the model and formulate a long term discounted optimization problem, where the state of each bandit evolves according to a Markov process, and this evolution is action dependent. The state of a bandit is not observable but one of finitely many feedback signals are observable. Each bandit yields a reward, based on the action taken on that bandit. The agent is assumed to have a budget constraint. The bandits are assumed to be independent. However, they are weakly coupled at the agent through the budget constraint. We first analyze the Lagrangian bound method for our partially observable restless bandits. The computation of optimal value functions for finite-state, finite-action POMDPs is non-trivial. Hence, the computation of Lagrangian bounds is also challenging. We describe approximations for the computation of Lagrangian bounds using point based value iteration (PBVI) and online rollout policy. We further present various properties of the value functions and provide theoretical insights on PBVI and online rollout policy. We study heuristic policies for multi-actions PORMAB. Finally, we discuss present Whittle index policies and their limitations in our model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes