Batched Stochastic Linear Bandits with 1-Bit Communication Constraints

Ivan Lau, Daniel McMorrow, Kevin Jamieson, Jonathan Scarlett

arXiv:2605.3097612.8h-index: 14

Predicted impact top 55% in ML · last 90 daysOriginality Highly original

AI Analysis

This work addresses the practical challenge of limited communication in online decision-making for systems where feedback is batched and severely constrained, such as in remote sensing or distributed learning.

This paper investigates batched stochastic linear bandits where the learner receives only a single bit of feedback per batch of $B$ arm pulls. The authors establish a minimax lower bound of $\\widetilde\\Omega(B\\min\\{d,\\log|\\mathcal{A}|\\} + \\sqrt{dT \\min\\{d,\\log|\\mathcal{A}|\\}})$ regret and propose two phased-elimination algorithms. One algorithm achieves $\\widetilde{O}(dB + d\\sqrt{T})$ regret, matching the lower bound for large action spaces, while the other achieves $\\widetilde{O}(B\\log|\\mathcal{A}| + d^{3/2}\\sqrt{B} + \\sqrt{dT\\log|\\mathcal{A}|})$ regret, which is near-optimal in various scaling regimes.

We study stochastic linear bandits under a natural combination of batching and communication constraints: the time horizon is partitioned into batches of equal size $B$, and during each batch the learner sends $B$ requested arm pulls to an agent, who then observes the corresponding $B$ rewards and responds with a single bit of feedback to the learner. For each batch, the learner specifies the 1-bit quantization rule the agent uses, which may depend on all previously received bits but not on any past rewards directly. This setting addresses a significant yet unexplored ``middle ground'' between previous models having per-round quantization only or total bit budgets only. We establish a minimax lower bound showing that $Ω(B\min\{d,\log\lvert \mathcal{A} \rvert\})$ regret is unavoidable due to the 1-bit communication bottleneck, even in the absence of noise. Combined with standard statistical limits, this yields a general lower bound of $\widetildeΩ(B\min\{d,\log\lvert \mathcal{A} \rvert\} + \sqrt{dT \min\{d,\log\lvert \mathcal{A} \rvert\}})$. We develop two phased-elimination algorithms based on $G$-optimal designs and 1-bit mean estimation. The first achieves $\widetilde{O}(dB + d\sqrt{T})$ regret, matching the lower bound up to logarithmic factors when $\lvert \mathcal{A} \rvert = \exp(Ω(d))$, and the second incorporates a safe-arm identification and warm-start procedure to obtain $\widetilde{O}(B\log\lvert \mathcal{A} \rvert + d^{3/2}\sqrt{B} + \sqrt{dT\log\lvert \mathcal{A} \rvert})$ regret, which is near-optimal in broad scaling regimes of $(\lvert \mathcal{A} \rvert, B, d, T)$. Together, our results demonstrate that a single bit of feedback per batch suffices to nearly match the minimax regret of unconstrained linear bandits in broad scaling regimes, even for batch sizes as large as $Θ(\sqrt{T})$.

View on arXiv PDF

Similar