MLLGNov 22, 2023

Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks

arXiv:2311.13180v23 citationsh-index: 48
Originality Incremental advance
AI Analysis

This addresses the challenge of efficient learning in high-dimensional bandit problems with batched feedback, which is incremental by building on existing bandit methods but adapts them to batch constraints.

The paper tackles the problem of high-dimensional contextual bandits with batched feedback, where online interactions are divided into batches and rewards are revealed only at batch ends, common in applications like personalized medicine and online advertising. It designs a provably sample-efficient algorithm that achieves regret bounds of $\mathcal{ ilde O}(s_0^2 \log^2 T)$ for sparse cases and $\mathcal{ ilde O}(r^2 \log^2 T)$ for low-rank cases using only $\mathcal{O}(\log T)$ batches, comparable to fully sequential settings.

We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Such a feedback structure is popular in applications such as personalized medicine and online advertisement, where the online data often do not arrive in a fully serial manner. We consider high-dimensional and linear settings where the reward function of the bandit model admits either a sparse or low-rank structure and ask how small a number of batches are needed for a comparable performance with fully dynamic data in which $L = T$. For these settings, we design a provably sample-efficient algorithm which achieves a $ \mathcal{\tilde O}(s_0^2 \log^2 T)$ regret in the sparse case and $ \mathcal{\tilde O} ( r ^2 \log^2 T)$ regret in the low-rank case, using only $L = \mathcal{O}( \log T)$ batches. Here $s_0$ and $r$ are the sparsity and rank of the reward parameter in sparse and low-rank cases, respectively, and $ \mathcal{\tilde O}(\cdot)$ omits logarithmic factors involving the feature dimensions. In other words, our algorithm achieves regret bounds comparable to those in fully sequential setting with only $\mathcal{O}( \log T)$ batches. Our algorithm features a novel batch allocation method that adjusts the batch sizes according to the estimation accuracy within each batch and cumulative regret. Furthermore, we also conduct experiments with synthetic and real-world data to validate our theory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes