LGNov 12, 2025

MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

arXiv:2511.09324v1h-index: 2
Originality Highly original
AI Analysis

This addresses the challenge of adapting to dynamic environments in decision-making systems, such as recommender systems, but is incremental as it builds on existing RMAB frameworks.

The paper tackles the problem of decision-making in nonstationary environments by introducing MARBLE, a model that augments Restless Multi-Armed Bandits with a latent Markov state, and shows that their Q-learning method converges to an optimal policy, validated on a recommender system simulator.

Restless Multi-Armed Bandits (RMABs) are powerful models for decision-making under uncertainty, yet classical formulations typically assume fixed dynamics, an assumption often violated in nonstationary environments. We introduce MARBLE (Multi-Armed Restless Bandits in a Latent Markovian Environment), which augments RMABs with a latent Markov state that induces nonstationary behavior. In MARBLE, each arm evolves according to a latent environment state that switches over time, making policy learning substantially more challenging. We further introduce the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability assumption and prove that, despite unobserved regime switches, under the MAI criterion, synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and the corresponding Whittle indices. We validate MARBLE on a calibrated simulator-embedded (digital twin) recommender system, where QWI consistently adapts to a shifting latent state and converges to an optimal policy, empirically corroborating our theoretical findings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes