LGDSGTMLFeb 10, 2020

Combinatorial Semi-Bandit in the Non-Stationary Environment

arXiv:2002.03580v225 citations
AI Analysis

This addresses the challenge of decision-making under uncertainty in dynamic environments for applications like online advertising or recommendation systems, with incremental improvements in algorithm design.

The paper tackles the non-stationary combinatorial semi-bandit problem, achieving nearly optimal regret bounds of $ ilde{\mathcal{O}}(\sqrt{\mathcal{S} T})$ in the switching case and $ ilde{\mathcal{O}}(\mathcal{V}^{1/3}T^{2/3})$ in the dynamic case for general scenarios, and develops a parameter-free algorithm for a linear reward special case.

In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists \cite{wang2017improving}, our algorithm achieves $\tilde{\mathcal{O}}(\sqrt{\mathcal{S} T})$ distribution-dependent regret in the switching case, and $\tilde{\mathcal{O}}(\mathcal{V}^{1/3}T^{2/3})$ in the dynamic case, where $\mathcal S$ is the number of switchings and $\mathcal V$ is the sum of the total ``distribution changes''. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter $\mathcal S$ or $\mathcal V$ in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters $\mathcal S$ or $\mathcal V$ but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes