LGSPFeb 10, 2023

Piecewise-Stationary Multi-Objective Multi-Armed Bandit with Application to Joint Communications and Sensing

arXiv:2302.05257v24 citationsh-index: 19
Originality Incremental advance
AI Analysis

This work addresses decision-making in dynamic multi-objective settings, with applications like joint communications and sensing, representing an incremental advancement in bandit algorithms.

The paper tackles a multi-objective multi-armed bandit problem in a dynamic environment with piecewise-stationary reward distributions, proposing a Pareto UCB-based algorithm with change detection that achieves regret bounds of order γ_T log(T/γ_T) when breakpoints are known and γ_T log(T) otherwise, as validated by numerical experiments on synthetic and real-world datasets.

We study a multi-objective multi-armed bandit problem in a dynamic environment. The problem portrays a decision-maker that sequentially selects an arm from a given set. If selected, each action produces a reward vector, where every element follows a piecewise-stationary Bernoulli distribution. The agent aims at choosing an arm among the Pareto optimal set of arms to minimize its regret. We propose a Pareto generic upper confidence bound (UCB)-based algorithm with change detection to solve this problem. By developing the essential inequalities for multi-dimensional spaces, we establish that our proposal guarantees a regret bound in the order of $γ_T\log(T/{γ_T})$ when the number of breakpoints $γ_T$ is known. Without this assumption, the regret bound of our algorithm is $γ_T\log(T)$. Finally, we formulate an energy-efficient waveform design problem in an integrated communication and sensing system as a toy example. Numerical experiments on the toy example and synthetic and real-world datasets demonstrate the efficiency of our policy compared to the current methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes