LG MLJan 8, 2019

Bilinear Bandits with Low-rank Structure

Kwang-Sung Jun, Rebecca Willett, Stephen Wright, Robert Nowak

arXiv:1901.02470v220.274 citations

Originality Incremental advance

AI Analysis

This addresses the exploration-exploitation tradeoff in bandit problems with structured rewards for applications like recommendation systems, though it is incremental as it builds on existing linear bandit methods.

The paper tackles the bilinear bandit problem with low-rank structure, where rewards depend on a low-rank matrix, by proposing a two-stage algorithm (ESTR) that achieves a regret bound of $\widetilde{\mathcal{O}}((d_1+d_2)^{3/2} \sqrt{r T})$, improving upon the naive linear bandit reduction's $\widetilde{\mathcal{O}}(d_1d_2\sqrt{T})$.

We introduce the bilinear bandit problem with low-rank structure in which an action takes the form of a pair of arms from two different entity types, and the reward is a bilinear function of the known feature vectors of the arms. The unknown in the problem is a $d_1$ by $d_2$ matrix $\mathbfΘ^*$ that defines the reward, and has low rank $r \ll \min\{d_1,d_2\}$. Determination of $\mathbfΘ^*$ with this low-rank structure poses a significant challenge in finding the right exploration-exploitation tradeoff. In this work, we propose a new two-stage algorithm called "Explore-Subspace-Then-Refine" (ESTR). The first stage is an explicit subspace exploration, while the second stage is a linear bandit algorithm called "almost-low-dimensional OFUL" (LowOFUL) that exploits and further refines the estimated subspace via a regularization technique. We show that the regret of ESTR is $\widetilde{\mathcal{O}}((d_1+d_2)^{3/2} \sqrt{r T})$ where $\widetilde{\mathcal{O}}$ hides logarithmic factors and $T$ is the time horizon, which improves upon the regret of $\widetilde{\mathcal{O}}(d_1d_2\sqrt{T})$ attained for a naïve linear bandit reduction. We conjecture that the regret bound of ESTR is unimprovable up to polylogarithmic factors, and our preliminary experiment shows that ESTR outperforms a naïve linear bandit reduction.

View on arXiv PDF

Similar