LGFeb 18, 2025

Contextual Linear Bandits with Delay as Payoff

Mengxiao Zhang, Yingfei Wang, Haipeng Luo

arXiv:2502.12528v213.04 citationsh-index: 11ICML

Originality Incremental advance

AI Analysis

This work addresses a limitation in bandit models for real-world applications like recommendation systems, though it is incremental by extending a prior delay-as-payoff model to contextual settings.

The paper tackles the problem of contextual linear bandits with payoff-dependent delays, where delays are proportional to payoffs, by proposing an efficient algorithm that achieves a regret overhead of at most $D\Delta_{\max}\log T$ compared to the no-delay case, with further improvements for loss payoffs.

A recent work by Schlisselberg et al. (2024) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff itself. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is at most $DΔ_{\max}\log T$, where $T$ is the total horizon, $D$ is the maximum delay, and $Δ_{\max}$ is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2024). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking actions in a volumetric spanner of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.

View on arXiv PDF

Similar