LGAIMLFeb 2, 2023

Stochastic Contextual Bandits with Long Horizon Rewards

arXiv:2302.00814v23 citationsh-index: 42
AI Analysis

This work addresses the challenge of long-range temporal dependencies in decision-making and language modeling for researchers in machine learning, though it is incremental as it builds on existing bandit frameworks with new analysis.

The paper tackles the problem of sample-efficient learning in contextual linear bandits with long-horizon rewards, where current rewards depend on up to s prior actions and contexts over a horizon h, and proposes algorithms that avoid polynomial dependence on h by leveraging sparsity. The results include regret upper bounds of ˜O(d√sT + min{q, T}) in data-poor regimes and ˜O(√sdT) in data-rich regimes, with sparsity s, dimension d, and adaptive parameter q.

The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$. In order to avoid polynomial dependence on $h$, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$) regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$, total time horizon $T$, and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon $h$. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes