Model-Based Learning of Whittle indices
This work addresses the challenge of efficiently learning Whittle indices in MDPs, which is important for researchers and practitioners in reinforcement learning and resource allocation, but it is incremental as it builds on existing algorithms with extensions and empirical improvements.
The paper tackles the problem of learning Whittle indices for indexable, communicating, and unichain Markov Decision Processes (MDPs) by introducing BLINQ, a model-based algorithm that builds an empirical estimate of the MDP and computes indices using an extended state-of-the-art method. The result shows that BLINQ significantly outperforms Q-learning approaches in sample efficiency and computational cost, with numerical experiments indicating lower sample requirements and total cost even when Q-learning is enhanced with neural networks.
We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an empirical estimate of the MDP and then computing its Whittle indices using an extended version of a state-of-the-art existing algorithm. We provide a proof of convergence to the Whittle indices we want to learn as well as a bound on the time needed to learn them with arbitrary precision. Moreover, we investigate its computational complexity. Our numerical experiments suggest that BLINQ significantly outperforms existing Q-learning approaches in terms of the number of samples needed to get an accurate approximation. In addition, it has a total computational cost even lower than Q-learning for any reasonably high number of samples. These observations persist even when the Q-learning algorithms are speeded up using pre-trained neural networks to predict Q-values.