OCMar 9, 2013
Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit ProblemsSattar Vakili, Keqin Liu, Qing Zhao
In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with unknown reward models. At each time, a player selects one arm to play, aiming to maximize the total expected reward over a horizon of length T. An approach based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is developed for constructing sequential arm selection policies. It is shown that for all light-tailed reward distributions, DSEE achieves the optimal logarithmic order of the regret, where regret is defined as the total expected reward loss against the ideal case with known reward models. For heavy-tailed reward distributions, DSEE achieves O(T^1/p) regret when the moments of the reward distributions exist up to the pth order for 1<p<=2 and O(T^1/(1+p/2)) for p>2. With the knowledge of an upperbound on a finite moment of the heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret order. The proposed DSEE approach complements existing work on MAB by providing corresponding results for general reward distributions. Furthermore, with a clearly defined tunable parameter-the cardinality of the exploration sequence, the DSEE approach is easily extendable to variations of MAB, including MAB with various objectives, decentralized MAB with multiple players and incomplete reward observations under collisions, MAB with unknown Markov dynamics, and combinatorial MAB with dependent arms that often arise in network optimization problems such as the shortest path, the minimum spanning, and the dominating set problems under unknown random weights.
SYDec 1, 2011
Dynamic Intrusion Detection in Resource-Constrained Cyber NetworksKeqin Liu, Qing Zhao
We consider a large-scale cyber network with N components (e.g., paths, servers, subnets). Each component is either in a healthy state (0) or an abnormal state (1). Due to random intrusions, the state of each component transits from 0 to 1 over time according to certain stochastic process. At each time, a subset of K (K < N) components are checked and those observed in abnormal states are fixed. The objective is to design the optimal scheduling for intrusion detection such that the long-term network cost incurred by all abnormal components is minimized. We formulate the problem as a special class of Restless Multi-Armed Bandit (RMAB) process. A general RMAB suffers from the curse of dimensionality (PSPACE-hard) and numerical methods are often inapplicable. We show that, for this class of RMAB, Whittle index exists and can be obtained in closed form, leading to a low-complexity implementation of Whittle index policy with a strong performance. For homogeneous components, Whittle index policy is shown to have a simple structure that does not require any prior knowledge on the intrusion processes. Based on this structure, Whittle index policy is further shown to be optimal over a finite time horizon with an arbitrary length. Beyond intrusion detection, these results also find applications in queuing networks with finite-size buffers.
OCFeb 15, 2011
Decentralized Restless Bandit with Multiple Players and Unknown DynamicsHaoyang Liu, Keqin Liu, Qing Zhao
We consider decentralized restless multi-armed bandit problems with unknown dynamics and multiple players. The reward state of each arm transits according to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random process when it is passive. Players activating the same arm at the same time collide and suffer from reward loss. The objective is to maximize the long-term reward by designing a decentralized arm selection policy to address unknown reward models and collisions among players. A decentralized policy is constructed that achieves a regret with logarithmic order when an arbitrary nontrivial bound on certain system parameters is known. When no knowledge about the system is available, we extend the policy to achieve a regret arbitrarily close to the logarithmic order. The result finds applications in communication networks, financial investment, and industrial engineering.
MLJul 6, 2023
General Formulation and PCL-Analysis for Restless Bandits with Limited ObservabilityKeqin Liu, Qizhen Jia
In this paper, we consider a general observation model for restless multi-armed bandit problems. The operation of the player is based on the past observation history that is limited (partial) and error-prone due to resource constraints or environmental or intrinsic noises. By establishing a general probabilistic model for dynamics of the observation process, we formulate the problem as a restless bandit with an infinite high-dimensional belief state space. We apply the achievable region method with partial conservation law (PCL) to the infinite-state problem and analyze its indexability and priority index (Whittle index). Finally, we propose an approximation process to transform the problem into which the AG algorithm of Niño-Mora (2001) for finite-state problems can be applied. Numerical experiments show that our algorithm has excellent performance.
LGAug 9, 2021
Low-Complexity Algorithm for Restless Bandits with Imperfect ObservationsKeqin Liu, Richard Weber, Chengzhong Zhang
We consider a class of restless bandit problems that finds a broad application area in reinforcement learning and stochastic optimization. We consider $N$ independent discrete-time Markov processes, each of which had two possible states: 1 and 0 (`good' and `bad'). Only if a process is both in state 1 and observed to be so does reward accrue. The aim is to maximize the expected discounted sum of returns over the infinite horizon subject to a constraint that only $M$ $(<N)$ processes may be observed at each step. Observation is error-prone: there are known probabilities that state 1 (0) will be observed as 0 (1). From this one knows, at any time $t$, a probability that process $i$ is in state 1. The resulting system may be modeled as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with even finite state spaces are PSPACE-HARD in general. We propose a novel approach for simplifying the dynamic programming equations of this class of restless bandits and develop a low-complexity algorithm that achieves a strong performance and is readily extensible to the general restless bandit model with observation errors. Under certain conditions, we establish the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in the general parametric space. Furthermore, we theoretically prove the optimality of our algorithm for homogeneous systems.
NIJan 24, 2012
Adaptive Shortest-Path Routing under Unknown and Stochastically Varying Link StatesKeqin Liu, Qing Zhao
We consider the adaptive shortest-path routing problem in wireless networks under unknown and stochastically varying link states. In this problem, we aim to optimize the quality of communication between a source and a destination through adaptive path selection. Due to the randomness and uncertainties in the network dynamics, the quality of each link varies over time according to a stochastic process with unknown distributions. After a path is selected for communication, the aggregated quality of all links on this path (e.g., total path delay) is observed. The quality of each individual link is not observable. We formulate this problem as a multi-armed bandit with dependent arms. We show that by exploiting arm dependencies, a regret polynomial with network size can be achieved while maintaining the optimal logarithmic order with time. This is in sharp contrast with the exponential regret order with network size offered by a direct application of the classic MAB policies that ignore arm dependencies. Furthermore, our results are obtained under a general model of link-quality distributions (including heavy-tailed distributions) and find applications in cognitive radio and ad hoc networks with unknown and dynamic communication environments.