LGOct 21, 2021

Anti-Concentrated Confidence Bonuses for Scalable Exploration

arXiv:2110.11202v29 citations
AI Analysis

This addresses the scalability of exploration algorithms for reinforcement learning practitioners, though it is incremental as it builds on existing LinUCB methods.

The paper tackles the computational inefficiency of the LinUCB algorithm's elliptical bonus in high-dimensional exploration by introducing anti-concentrated confidence bounds, achieving $ ilde O(d \sqrt{T})$ regret bounds for stochastic linear bandits and competitive performance on Atari benchmarks.

Intrinsic rewards play a central role in handling the exploration-exploitation trade-off when designing sequential decision-making algorithms, in both foundational theory and state-of-the-art deep reinforcement learning. The LinUCB algorithm, a centerpiece of the stochastic linear bandits literature, prescribes an elliptical bonus which addresses the challenge of leveraging shared information in large action spaces. This bonus scheme cannot be directly transferred to high-dimensional exploration problems, however, due to the computational cost of maintaining the inverse covariance matrix of action features. We introduce \emph{anti-concentrated confidence bounds} for efficiently approximating the elliptical bonus, using an ensemble of regressors trained to predict random noise from policy network-derived features. Using this approximation, we obtain stochastic linear bandit algorithms which obtain $\tilde O(d \sqrt{T})$ regret bounds for $\mathrm{poly}(d)$ fixed actions. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes