Stochastic Low-Rank Bandits
This addresses a fundamental challenge in bandit problems for low-rank matrices, relevant to applications like computer vision and recommender systems, and is a novel contribution rather than incremental.
The paper tackles the problem of finding the maximum entry in a stochastic low-rank matrix from sequential noisy observations, proposing the LowRankElim algorithm with an O((K + L) poly(d) Δ^{-1} log n) regret bound, which is the first such result in the literature.
Many problems in computer vision and recommender systems involve low-rank matrices. In this work, we study the problem of finding the maximum entry of a stochastic low-rank matrix from sequential observations. At each step, a learning agent chooses pairs of row and column arms, and receives the noisy product of their latent values as a reward. The main challenge is that the latent values are unobserved. We identify a class of non-negative matrices whose maximum entry can be found statistically efficiently and propose an algorithm for finding them, which we call LowRankElim. We derive a $\DeclareMathOperator{\poly}{poly} O((K + L) \poly(d) Δ^{-1} \log n)$ upper bound on its $n$-step regret, where $K$ is the number of rows, $L$ is the number of columns, $d$ is the rank of the matrix, and $Δ$ is the minimum gap. The bound depends on other problem-specific constants that clearly do not depend $K L$. To the best of our knowledge, this is the first such result in the literature.