Variance-Dependent Regret Lower Bounds for Contextual Bandits
This work addresses a theoretical gap in contextual bandit algorithms for researchers, offering foundational lower bounds that are incremental but crucial for understanding algorithm limits.
The paper tackles the problem of establishing variance-dependent regret lower bounds for linear contextual bandits, overcoming limitations of prior work by providing bounds that match existing upper bounds up to logarithmic factors in both prefixed and adaptive variance sequence settings.
Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical $\tilde{O}(d\sqrt{K})$ regret bound to $\tilde{O}(d\sqrt{\sum_{k=1}^Kσ_k^2})$, where $d$ is the context dimension, $K$ is the number of rounds, and $σ^2_k$ is the noise variance in round $k$, has been widely studied in recent years. However, most existing works focus on the regret upper bounds instead of lower bounds. To our knowledge, the only lower bound is from Jia et al. (2024), which proved that for any eluder dimension $d_{\textbf{elu}}$ and total variance budget $Λ$, there exists an instance with $\sum_{k=1}^Kσ_k^2\leq Λ$ for which any algorithm incurs a variance-dependent lower bound of $Ω(\sqrt{d_{\textbf{elu}}Λ})$. However, this lower bound has a $\sqrt{d}$ gap with existing upper bounds. Moreover, it only considers a fixed total variance budget $Λ$ and does not apply to a general variance sequence $\{σ_1^2,\ldots,σ_K^2\}$. In this paper, to overcome the limitations of Jia et al. (2024), we consider the general variance sequence under two settings. For a prefixed sequence, where the entire variance sequence is revealed to the learner at the beginning of the learning process, we establish a variance-dependent lower bound of $Ω(d \sqrt{\sum_{k=1}^Kσ_k^2 }/\log K)$ for linear contextual bandits. For an adaptive sequence, where an adversary can generate the variance $σ_k^2$ in each round $k$ based on historical observations, we show that when the adversary must generate $σ_k^2$ before observing the decision set $\mathcal{D}_k$, a similar lower bound of $Ω(d\sqrt{ \sum_{k=1}^Kσ_k^2} /\log^6(dK))$ holds. In both settings, our results match the upper bounds of the SAVE algorithm (Zhao et al., 2023) up to logarithmic factors.